Provost Ch. 10

TEXT

Very important because we communicate and use it online

Difficult because of lack of structure

As data, text is relatively dirty

IMportant to change text into data

BAG OF WORDS: treat every document as a collection of individual words

TERM FREQUENCY: amount word is used

IDF: weighting certain terms

CAN combine those two

click to edit

THE DATA

Preprocess data by noticing little things

check timestamps on stories

With numerical data, watch for day by day data

some data is combined in with the words

In the case of stock prices

use open and close numbers

subtract to see day make or loss

Topic Models

The additional layer between the document and the model

Sort through to find different topics in document

Then use certain words to relate to each topic

Riley Hillis