Please enable JavaScript.
Coggle requires JavaScript to display documents.
Representing And Mining Text (Example: Mining News Stories to Predict…
Representing And Mining Text
Many legacy applications still produce or record text.
Medical re‐
cords, consumer complaint logs, product inquiries, and repair records
are still mostly
intended as communication between people, not computers, so they’re still “coded” as
text
It contains a vast amount of text in the form of personal web pages
Twitter feeds,
email, Facebook status updates
Why Text Is Difficult
Text is often referred to as “unstructured” data
text does not
have the sort of structure that we normally expect for data
tables of records with fields
having fixed meanings (essentially, collections of feature vectors)
Representation
A document is one piece of text, no matter how large or small
A document is
composed of individual tokens or terms
A collection of documents is called a corpus
Bag of Words
the
approach is to treat every document as just a collection of individual words.
It
treats every word in a document as a potentially important keyword of the document.
So if every word is a possible feature
each document is represented by a one (if the token is present in the document) or
a zero (the token is not present in the document)
Term Frequency
use the word count (frequency) in the document instead of just
a zero or one
allows us to differentiate between how many times a word is used
in some applications, the importance of a term in a document should increase with the
number of times that term occurs
term frequency representation.
Measuring Sparseness: Inverse Document Frequency
a term should not be too rare
say the unusual word prehensile occurs
in only one document in your corpus. Is it an important term
depend on the
application
. For clustering, there is no point keeping a term that occurs only once
it will never be the basis of a meaningful cluster
a term should not be too common
A term occurring in every document isn’t useful for classification
Overly common terms are typically eliminated
impose an
arbitrary upper limit on the number (or fraction) of documents in which a word may
occur
Combining Them: TFIDF
Term Frequency (TF) and
Inverse Document Frequency (IDF)
Term Frequency (TF) and
Inverse Document Frequency (IDF)
Term
counts within the documents form the TF values for each term
the document
counts across the corpus form the IDF values
Example: Mining News Stories to Predict Stock Price
Movement
The Task
We want to predict stock price changes based on financial news
The Data
The data to be mined are historical data from 1999 for stocks listed on the New York
Stock Exchange and NASDAQ
Data Preprocessing
Each stock has an opening and closing price
for the day, measured at 9:30 am EST and 4:00 pm EST, respectively. From these values
we can easily compute a percentage change
Results