Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 10 Representing and Mining Text (Representation (First approach…
Chapter 10 Representing and Mining Text
Why text is important?
Vast amount of text in the Internet
Twitter feeds
Emails
Personal web pages
Facebook updates
Business needs
Customer Feedbacks
Legacy applications
Representation
Tokens or terms
corpus
Document: one piece of text
First approach
Bags of words
A collection of individuals words
Term Frequency
Normalized: every word is lowercase
Stemmed: suffixes removed
Stopwords: remove the, and, of ..
Measuring sparseness
Not too rare
Not overly common
Imposing upper and lower limits on term frequency
TFIDF: Term Frequency & Inverse Document Frequency
Example: Jazz Musicans
The relationship of IDF and Entropy
.
Why text is difficult?
Synonyms and homographs
Context
Unstructured data
N-gram sequences
sequences of adjacent words as terms
Disadvantages
Named Entity Extraction
Knowledge intensive
Topic Models
Matrix factorization method
Example: Mining news stories to predict stock price movement
The Task breakdown
Period: same day
Change or no change
Surge, Stable and Plunge
The Data
The stream of news stories
A corresponding stream of daily stock prices
The data preprocessing