Please enable JavaScript.
Coggle requires JavaScript to display documents.
CHAPTER 10: REPRESENTING & MINING TEXT (Measuring Sparseness: Inverse…
CHAPTER 10: REPRESENTING & MINING TEXT
Why Text Is Important
text is everywhere
medical records, customer complaints, product inquiries, and repair records
in business understanding customer feedback often requires understanding text
Why Text is Difficult
often referred to as "unstructured data"
text does not have the sort of structure that we normally expect from data
as data, text is relatively "dirty"
ungrammatical, misspelled words, run words together, unpredictable abbreviation and random punctuation
even when text is flawlessly expressed it may contain synonyms (multiple words with same meaning)
Representation
transform a body of text into a set of data that can be fed into a data mining algorithm
collection of documents - corpus
Bag of Words
taking a set of documents and turning it into familiar feature vector form
Term Frequency
use word count instead of just one or zero
Measuring Sparseness: Inverse Document Frequency
how prevalent a term is in a single document
how common it is in the entire corpus we're mining
a term should not be too rare
Topic Models
term should not be too common
Doesnt distinguish anything
Name entity extraction
Data Preprocessing
we have two streams of data
predict stories that product a substantial change
reduce each story to a TFIDF represntation
each word case-normalized and stemmed and stop words removed
TRACY GIANG