Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 10: Representing and Mining Text (Represenetation (Document (One…
Chapter 10: Representing and Mining Text
Why Text is Important
Understanding customer feedback often requires understanding text
Why is Text Difficult
Text is often unstructured data
Text does not have the sort of structure that we normally expect for data
Linguistic Strucuture
Represenetation
Document
One piece of text, no matter how large or short
Tokens
or
Terms
A word
Corpus
Collection of documents
Bag of Words
Treat every document as just a collection of individual words-- This approach ignores grammar, etc.
Term Frequency
Word count in the document instead of just zero or one
This allows us to differentiate between how many times a word is used
The importance of a term in a document should increase with the # of times that term occurs
Each sentence is considered a separate document
Usually basic processing is performed before putting them on the table
Measuring Sparseness: Inverse Document Frequency
A term should not be too
rare
The term should not be too
common
IDF(
t)
=1+log (Total number of documents/ Number of documents containing
t
)
Combining Them: TFIDF
Term Frequency
and
Inverse Document Frequency
TFIDF(
t,d
) = TF(
t,d
) x IDF(*t)
Beyond Bag of Words
Simple, requires no sophisticated parsing ability or other linguistic analysis
N-Gram Sequence
Include sequences of adjacent words as terms
Adjacent pairs are commonly called bi-grams
Useful when particular phrases are significant but their component words many not be
Named Entity Extraction
Knowledge intensive
To work well they have to be trained on a big corpus
Topic Models
Model the set of topics in a corpus separately