Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 10: Representing and Mining Text (Representation (Term Frequency…
Chapter 10: Representing and Mining Text
Why Text is Important
everywhere
medical
consumer complaints
product inqueries
to understand the person
we must read
Why Text is Difficult
Unstructured data
linguistic structure
but not normal to data
determining context
must process a lot
Representation
document is
a piece of text
composed of tokens/terms
a word
collection of documents
corpus
Bag of Words
treats all docs as
lots of words
straight forward and inexpensive
reduces a doc to its words
Term Frequency
how often a word appears
produces a table
for word counts
Step 1
case is normalized
no longer case sensitive
Step 2
stemming
suffixes are removed
Step 3
stopwords removed
the, and of, on
Measuring Sparseness
a word
shouldn't be too rare
eliminated if it occurs too little
shouldn't be too common
overcommon are eliminated
measured by
inverse doc frequency
can be thought of as
boost a term gets for being rare
Combining Them: TFIDF
product of
term frequency (TF)
Inverse doc frequency (IDF)
term = t
document = d
doing this allows for them to
go into data mining algorithms
not necessarily optimal
The Relationship of IDF to Entropy
entropy =/ IDF
however they are related
when combine
they create an expected value equation
Beyond Bag of Words
N-gram Sequences
word order is important
sequences of adjacent words
phrases are significant
even if
individual words aren't
require linguistic knowledge
disadvantage
greatly increase size of feature set
Name Entity Extraction
sophistication in phrase extraction
ex: Silicon Valley
knows when word sequences
constitute propper names
knowledge intensive
lots of training
or hand coding
Topic Models
topic layer
layer btwn doc and model
model topics separately
advantage
can use terms that
don't directly match
model can stay relevant
even if it isn't perfect
matrix factorization methods
looks to cluster words
in theory (math is hard)