Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 10: Representing and Mining Text (Representation (A document…
Chapter 10: Representing and Mining Text
N-gram Sequences
useful when phrases are significant
Entity extractor
Representation
A document (text) is composed of tokens and terms
Collection of documents = corpus
Bag of words
Treats a document as a collection of individual words
Word count = frequency
Determine importance of a word
Normalize words
Words have been "Stemmed"
Suffixes removed
Remove stopwords
the,and, of, etc.
Term should not be too rare
Term should not be too common
Jazz musician example
TFIDF
Text is important
It is everywhere
To interpret: must convert it into meaningful form
Old media
New media
Have to understand customer feedback
Considered "unstructured data"
Linguistic structure
Meant for humans, not computers
Text is "dirty"
Context is important
Positive
Negative
Topic models
Efficient
Refers to words
Made with matrix factorization methods
Topic layers
Often have to make assumputuions
Ex
assume same day
Satisfied with the direction
Change
No change
Predict relatively large changes
Narrow the "causal radius"
Surges
Plunges
Stable
Data processing:
Results
Change = positive class
No change = negative class
Determine significance
Most data sets require lots of data representation to make them okay to mine
Special info usually requires special processing
and sometimes special knowledge