Please enable JavaScript.
Coggle requires JavaScript to display documents.
Provost Ch10: Representing and Mining Text (Bag of Words (to treat every…
Provost Ch10: Representing and Mining Text
Why Text is Important
text is everywhere
must be converted to meaningful form
In business, understanding customer feedback often requires understanding text
Why Text is Difficult
“unstructured” data
text does not
have the sort of structure that we normally expect for data
linguistic structure
Representation
The general strategy in text mining is to use the simplest (least expensive) technique that works
document = one piece of text, no matter how large or small
document is composed of individual tokens or terms
corpus = collection of documents
Bag of Words
to treat every document as just a collection of individual words
ignores grammar, word order, sentence structure, and (usually) punctuation
treats every word in a document as a potentially important keyword of the document
straightforward and inexpensive to generate, and tends to work well for many tasks.
Term Frequency
word count (frequency) in the document instead of just
a zero or one
differentiate between how many times a word is used
term frequency representation = importance of a term in a document should increase with the number of times that term occurs
Measuring Sparseness: Inverse Document Frequency
frequency measures how prevalent a term is in a single document
when deciding the weight of a term, how common it is in the entire corpus we’re
mining
two opposing considerations
a term should not be too rare
a term should not be too common
Combining Them: TFIDF
Term Frequency (TF) and
Inverse Document Frequency (IDF)
TFIDF(t, d) = TF(t, d) × IDF(t)
James Frainey
Jafr4672@colorado.edu