Please enable JavaScript.
Coggle requires JavaScript to display documents.
Ch10. Representing and Mining Text (Intro (Fundamental concepts: The…
Ch10. Representing and Mining Text
Intro
Fundamental concepts: The importance of constructing mining-friendly data representations.
Why text is important
Its everywhere!
Why text is difficult
"unstructured", grammar, spelling, synonyms
Representation
document - piece of text, tokens or terms- words, corpus-collection of documents
Bag of word
Term frequency
Count frequency of separate words in a document, sort, normalize, stemmed, stopwords
Measuring Sparseness: Inverse document frequency
IDF(t)= 1+log(total number of documents/number of documents containing t
Combining them: TFIDF
TFIDF(t,d)=TF(t,d)*IDF(t)
Example: Jazz Musicians
The relationship of IDF to Entropy
the probability that t occurs in document set , entropy
Results
modifiers (not, despite, expect)
Beyond bag of words
N-gram sequences
if phrases are more important than just words
Named entity extraction
Words that are associated together ex. New York and annotations
Topic models
Creating topics based on words appearing together
Example: mining new stories to predict price movement
predicting stock market based on news articles
The task
same day 2.change vs no change 3.large changes 4.causal radius --> surge, stable, plunge
The data
NASDAQ, Yahoo finance, google finance
Data preprocessing
percent change based on opening and closing.
Summary
Text often require more preprocessing than numbers but is widely used and therefore very important