chapter 10 : Representing and Mining Text (ways to transform text so that…
chapter 10 : Representing and Mining Text
text requires more pre-processing steps than other forms of data
one of the more common forms of data
But it is unstructures making it harder for a computer or a model to interpret
ways to transform text so that it can be fed into a data mining algorithm
Transforming text into a vector
Bag of words
treat every document as if it were a collection of words. ignores grammar, context, word order, etc.
Benefits: straightforward and inexpensive to generate
turns words into binary: does the document contain this word? Yes or no?
Term Frequency: Finding the frequency of each word in a document
Organizing text by frequency of word
Steps to term Frequency: 1) normalize case 2)stem words so that suffixes are removed 3) remove what are referred to as "stop words".... essentially just super common words i.e. (if, the, of)
Measuring Sparseness: Inverse Document frequency. Instead of measuring term frequency, you measure if a word is used infrequently in a set of documents.
IDF(t) = 1+ log (total # of documents/# of documents containing word T)
Term Frequency and IDF can be combined( very common approach)
assigns a value to each document based on frequency and rarity of set terms
Memory Trigger: Jazz musicians example
n gram is good for when bag of words applications do not go far enough.
groups words into buckets of adjacent words.
adjacent pairs are called bi grams.
more complex phrase recognition
Named Entity Extraction
Hard coding name entities to be extracted. I.e Silicon Valley, Cu boulder, seattle mariners
Example: Mining text to predict stock prices movement
Based on news, mapping /predicting whether stock will experience up/down/no change
changes must be relatively large
Per the example, the Roc and Lift charts show variability. This demonstrates the difficulty of this sort of task