Please enable JavaScript.
Coggle requires JavaScript to display documents.
Provost: Chapter 10 Representing and Mining Text (Text Data (Popular form…
Provost: Chapter 10
Representing and Mining Text
Data Preparation
important step before data mining
Text Data
Popular form of data
Requires pre-processing steps and expertise sometimes
Everywhere - all types of text records
Unstructured data
For linguistic consumption - computers
Text Mining
Representation
Term Frequency
raw count and representation
normalization and steming
Stopwords
Stem
Normalized: all in lowercase
Sparseness
Inverse doc frequency
no terms too rare
no terms too common
Combining
TFIDF
Term Frequency and Inverse Document Frequency
Bag of Words
n-grams
cons: increase size of feature set
pros: easy to generate
name entity extractions
more intricate
Key Terms
Surge
Plunge
Stable