Provost Ch 10

Real-world data prep

Make feature match existing tools
Easier

Create tools to match the data

Text processing

Requires dedicated preprcessing steps

requires specific expertise

Why text is difficult

"unstructured data"

Varying word lengths and number of words creates difficulty for computers

People use poor grammar and often make spelling msitakes

Synonyms create issues when considering word meaning

Context and punctuation is key, and is which humans can pick up on, but computers struggle

How to work with Text

a "document" can be any piece of text, regardless of length or structure

A "token" or "term" is one word

Bag of words

treat a document as a collection of words

Ignore grammar, word order, sentence structure. and usually punctuation

straight forward and inexpensive

Term frequency

Analyze how frequency relates to importance

normalize words
ex: iPhone, IPHONE, iphone

stem words: remove prefix and suffixes to get to the root word

remove stop word: and, the, of ect.

IDF(t) = 1 + log (total # documents/# docs containing t)

Greater importance to words that are rare in common speech