Please enable JavaScript.
Coggle requires JavaScript to display documents.
Provost Ch. 10: Representing and Mining Text (Why Text is Important…
Provost Ch. 10: Representing and Mining Text
Why Text is Important
internet contains a lot of text
social media posts
emails
blog comments
google searches
understand customer feedback
product reviews
Why Text is Difficult
"unstructured" data
text is dirty
misspell words
abbreviate unpredictably
grammar errors
synonyms
Representation
general strategy
use simplest (least expensive) method that works
document
one piece of text, no matter how large or small
composed of individual tokens or terms
collection of documents is called corpus
Bag of Words
treat every document as a collection of individual words
ignores grammar, word order, sentence structure, and punctuation
Term Frequency
use word count in document
steps to putting words in a table
1) normalize words
2) stem words (remove suffixes)
3) remove stopwords
Measuring Sparseness
terms should not be too rare or too common
Inverse Document Frequency
1 + log(total #number of documents/number of documents containing t)
Beyond Bag or Words
N-Gram Sequences
useful when phrases are significant but not necessarily the component words
Named Entity Extraction
knowledge intensive