Please enable JavaScript.
Coggle requires JavaScript to display documents.
Data Science for Business By: Fawcett & Provost "CH 10:…
Data Science for Business
By: Fawcett & Provost
"CH 10: Representing and Mining Text"
Representation
Transforming a body
of text into a set of data
Terminology:
Document
: one
piece of text
Tokens/terms
: the
various components
of a document
Corpus
: a collection
of documents
Bag of Words
Approach
Treats every document as
a collection of individual words
Ignores grammar, word order,
sentence structure, & punctuation
Term
Frequency:
Produces table w/
word count (frequency)
Considerations
Case type has
been normalized
Words have
been
stemmed
Stopwords
have
been removed
Measuring
Sparseness
Term should not
be too rare
Term should not
be too common
Inverse Document
Frequency
(IDF)
Equation
: 1 + log(total # of documents / # documents containing
t
)
Relationship to
Entropy
IDF(not_t)
TFIDF
Product of Term Frequency (TF)
and IDF
Value is specific
to a single document
Each document becomes
a feature vector
Common representation
but may not be optimal
Example: Jazz Musicians
Apply basic stemming
Cosine Similarity
Function
Fundamental Concepts
Importance of constructing
mining-friendly data
representations
Representation of text
for data mining
Text
Why is it
important?
Text is
everywhere
i.e. medical records,
web pages,
Twitter feeds, etc.
Business: essential
for understanding
customer feedback
Why is it
difficult?
"Unstructured
data"
Linguistic
structure:
made for human
consumption
Context is
crucial
Relatively
dirty
Subjective to human
behavior and error
Beyond Bag
of Words
N-gram
Sequences
Include sequences
of adjacent words
as terms
Useful when phrases are
significant, but their component
words are not
Easy to generate
Requires no
linguisitc knowledge
No complex
parsing algorithm
Main disadvantage:
Greatly increases size
of the feature set
Named Entity
Extraction
Knowledge intensive
Requires large
corpus
Knowledge has to be
learned, or coded by hand
Topic Models
Modeling documents w/
a topic layer
Terms and term weights
are learned by modeling
process
Type of latent information
model
Example: Mining News Stories
to Predict Stock Price Movement
The Task
Predict the effect
on stock price for
the same day
Seek out direction
of stock price movement
(i.e. up, down, no change)
Predict relatively
large changes
Make assumptions to
narrow the "causal radius"
The Data
Two separate
time series
Stream of news
stories
Stream of daily
stock prices
Crucial to eliminate
some of the noise
Data Preprocessing
Measuring w/ appropriate
variables to find daily percent
change
Aligning stories w/ correct
day & trading window
Reject stories menntioning
more than two stocks
Each story is tagged w/
a label (change or no change)
Results
Bag of words representation
is primitive for this task
Named entity recognition
could be improved
More instantaneous
price changes due to
quick market responses