Chapter 10: Representing and Mining Text (Representation (A document…
Chapter 10: Representing and Mining Text
useful when phrases are significant
A document (text) is composed of tokens and terms
Collection of documents = corpus
Bag of words
Treats a document as a collection of individual words
Word count = frequency
Determine importance of a word
Words have been "Stemmed"
the,and, of, etc.
Term should not be too rare
Term should not be too common
Jazz musician example
Text is important
It is everywhere
To interpret: must convert it into meaningful form
Have to understand customer feedback
Considered "unstructured data"
Meant for humans, not computers
Text is "dirty"
Context is important
Refers to words
Made with matrix factorization methods
Often have to make assumputuions
assume same day
Satisfied with the direction
Predict relatively large changes
Narrow the "causal radius"
Change = positive class
No change = negative class
Most data sets require lots of data representation to make them okay to mine
Special info usually requires special processing
and sometimes special knowledge