Please enable JavaScript.
Coggle requires JavaScript to display documents.
Ch. 10 Representing and Mining Text (Text Data (Representation…
Ch. 10 Representing and Mining Text
Text Data
Why its important
Converting to meaningful form
user-generated content
be able to understand it
Text is difficult
Unstructured data
linguistic structure
not intended for computers
understand context
preprocessing before use as input
Representation
Use basic steps to transform text
Information Retrieval
Document
Tokens or Terms
Collection of document
Corpus
Taking free form text
create feature-vector form that is familiar
Bag of words
Treat every word in doc as individual
Creates a set of words
Term Frequency
word count
more a word is used the higher the importance
Normalize data
words are all same case
stemmed
stopwords removed
Measure sparseness
term shouldn't be too rare
Impose a limit
Term shouldn't be too common
Impose a limit
Inverse document frequency
measures sparseness
uses entire corpus
Combine Term Freq. and Inverse Doc. Freq.
Can use on single document
1 more item...
Beyond Bag of words
N-gram sequences
sequences of adjacent words as terms
particular phrase is significant
Increase size of feature set
Named entity extraction
recognize common named entities
Topic models
words map to topics
a clustering of words
term weights are learned
Learned through the process