Provost Ch. 10
TEXT
Very important because we communicate and use it online
Difficult because of lack of structure
As data, text is relatively dirty
IMportant to change text into data
BAG OF WORDS: treat every document as a collection of individual words
TERM FREQUENCY: amount word is used
IDF: weighting certain terms
CAN combine those two
click to edit
THE DATA
Preprocess data by noticing little things
check timestamps on stories
With numerical data, watch for day by day data
some data is combined in with the words
In the case of stock prices
use open and close numbers
subtract to see day make or loss
Topic Models
The additional layer between the document and the model
Sort through to find different topics in document
Then use certain words to relate to each topic
Riley Hillis