Please enable JavaScript.
Coggle requires JavaScript to display documents.
Ch 10: Representing & Mining Text (Representation (Bag of Words (Treat…
Ch 10: Representing & Mining Text
First try to engineer the data to match existing tools
Requires dedicated pre-processing steps
Why Text is Important
It is everywhere
Used to listen to consumer sentiment
Why Text is Difficult
Unstructured data
Different interpretations
Context is important
Representation
Borrowed from the field of Information Retreival
A document is composed of individual tokens or terms
Corpus: Collection of Documents
Bag of Words
Treat every document as a collection of individual words
Reduces the document to the words contained inside
Term Frequency
Higher frequency is more important
Organized in table
Measuring Sparsness: Inverse Document Frequency
Specify lower limit for the frequency
term should not be too common
IDF:the boost a term gets for being rare
TFIDF
Term Frequency and Inverse Document Frequency
Each document becomes a feature vector
Beyond Bag of Words
N-Grams Sequences
Treat sequences of adjacent words as terms
Disadvantage: greatly increases size of feature set
no linguistic knowledge or complex parsing
Named Entity Extraction
Text processessing toolkits
Have to be trained on a large corpus to work well
Topic Models
Querys can use terms that do not exactly match
Matrix factorization methods
Latent Semantic Indexing
Latent Dirichlet Allocation
Ex: Mining News Stories for Stock Price
Task
Want to make trades based on news
Recommend the most interesting/influental news stories
Difficult to predict the effect of news in advance
Difficult to predict the exact Stock Price
Difficult to predict small changes
stick to smaller
Define percentage zones
Data
Mine data from yahoo
Chart annotations with news timeline
Data Reprocessing
Dont pick opening because stocks are more sensitive
TFIDF
Results
ROC curve to determine accuracy