Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 10: Representing and Mining Text (Vocab ("N Gram…
Chapter 10: Representing and Mining Text
Problem: data does not come in a format friendly for mining
Examining text data
very common form of data
often referred to as "unstructured data"
often represented in many different forms
feature selection is often employed with text representation
TFIDF is a very common value representation for terms
example being "latin" as an important word regarding jazz music
Vocab
"document": one piece of text, no matter how large or small
composed of tokens or terms
"corpus": a collection of documents
"term frequency"- the word count of a document
"bag of words" approach: treat each document as collection of individual words
"normalized: every term is lowercase, "stemmed" the suffices have been removed, "stop words" the, and, of, on
"inverse document frequency" determining how important a word actually is to a document
= 1 + log(total number of documents/ number of documents containing t)
"TFIDF" product of term frequency and inverse document frequency = TF(t, d) * IDF(t)
"N Gram Sequences": includes sequences of frequent adjacent words
increase the size of the feature set
this can get out of hand
"Named Entity" extraction: sophisticated phrase extraction; knowledge intensive
"topic models" resulting models directly related to words
IDF to Entropy
estimate the probability of a term t in a document
simplify this by using the log
entropy = -p log (p) - p2 log (p2)
another example is mining news stories to predict stock prices