Please enable JavaScript.
Coggle requires JavaScript to display documents.
Ch. 10 Representing and Mining Text (Types of approach in Text Mining (Bag…
Ch. 10 Representing and Mining Text
Why is text important?
Understanding customer feedback
Why is text difficult?
Referred to as unstrucutural data
Infers text does not have the sort of structure that we normally expect from data
Relatively dirty
Synonyms and homographs
Representation
Transform body of text into data
Types of approach in Text Mining
Bag of words
Inexpensive
Treats every word as potential key word
Treats every document as just a collection of individual words
Ignores grammar, word order, sentence structure, and punctuation
Term frequency
Use the word count
Importance of term increases with number of times used
Measuring Sparseness: Inverse Document Frequency
Two opposing considerations against term frequency
Term not be too rare
Term not be too common
Impose upper and lower limits for two opposing considerations
IDF = 1+ log (total # of documents/ # if documents containing t)
Relationship to entropy
Binary term: p2 = 1-p1
entropy = - p1 log(p1) - p2 log(p2)
Express entropy as expected value of IDF(t)
Combining term frequency and inverse document frequency
TFIDF(t,d)= TF(t, d) * IDF(t)
Specific to single document
Each document becomes feature vector
Not necessarily optimal
Jazz musician exaple
N-gram sequences
Step up in terms of complexity
pairs adjacent words
bi-grams
Useful when particular phrases are significant but their component words may not be
Advantage is they are easy to generate
Disadvantage is that they increase the size of feature set
Named enttity extraction
Are knowledge intensive
Trained on largecopus
Knowledge needs to be trained or coded by hand
Topic Models
Model set of topics in corpus seperately
Final classifier defined in terms of intermediate topics rather than words
Advantage is query can use terms that do not exactly match specific words of a document
General methods for creating TM include matrix factorization
Mining news stories to predict stock price movement
The task
Want to predict stock price changes based on financial news
The Data
Data used comprise two separate time series
Stream of news stories
Corresponding stream of daily stock prices
Data Processing
Reject stories mentioning two different stocks
Gets rid of stories that are summaries
Bag of words applied to reduce each story to a TFIDF
The results
Logistic regression & Naive Bayes perform similar
No obvious region of superiority within ROC curves