Provost Ch 10
Real-world data prep
Make feature match existing tools
Easier
Create tools to match the data
Text processing
Requires dedicated preprcessing steps
requires specific expertise
Why text is difficult
"unstructured data"
Varying word lengths and number of words creates difficulty for computers
People use poor grammar and often make spelling msitakes
Synonyms create issues when considering word meaning
Context and punctuation is key, and is which humans can pick up on, but computers struggle
How to work with Text
a "document" can be any piece of text, regardless of length or structure
A "token" or "term" is one word
Bag of words
treat a document as a collection of words
Ignore grammar, word order, sentence structure. and usually punctuation
straight forward and inexpensive
Term frequency
Analyze how frequency relates to importance
normalize words
ex: iPhone, IPHONE, iphone
stem words: remove prefix and suffixes to get to the root word
remove stop word: and, the, of ect.
IDF(t) = 1 + log (total # documents/# docs containing t)
Greater importance to words that are rare in common speech