Please enable JavaScript.
Coggle requires JavaScript to display documents.
Provost Chapter 10: Representing and Mining Text (Example: Mining News…
Provost Chapter 10: Representing and Mining Text
Intro
Fundamental Concepts
The importance of constructing mining-friendly data representations
Representation of text for data mining
Representation
A document is one piece of text, no matter how large or small
A document is composed of individual tokens or terms. For now, think of a token or term as just a word
A collection of documents is called a corpus
Why Test Is Important
communication between people, not computers
the thrust of Web 2.0 was about Internet sites allowing users to interact with one another as a community, and to generate much added content of a site
Why Text Is Difficult
unstructured
refers to the fact that text does not have the sort of structure that we normally expect for data
Terminology and abbreviations in one domain might be meaningless in another domain
context is important
Bag of Words
the purpose of the text representation task
the approach is to treat every document as just a collection of individual words
Term Frequency
The next step up is to use the word count (frequency) in the document instead of just a zero or one
allows us to differentiate between how many times a word is used
Measuring Sparseness: Inverse Document Frequency
two opposing considerations
a term should not be too rare
opposite consideration is that a term should not be too common
Combining Them: TFIDF
A very popular representation for text is the product of Term Frequency (TF) and Inverse Document Frequency (IDF)
TFIDF(t, d) = TF(t, d) × IDF(t)
Each document thus becomes a feature vector, and the corpus is the set of these feature vectors
The Relationship of IDF to Entropy
IDF(t) = 1 + log ( Total number of documents) / ( Number of documents containing t )
Beyond Bag of Words
N-gram Sequences
word order is important
Named Entity Extraction
to recognize common named entities in documents
Topic Models
Because of the complexity of language and documents, sometimes we want an additional layer between the document and the model
Example: Mining News Stories to Predict Stock Price Movement
The Task
predict stock price fluctuations based on the text of news stories
The Data
the stream of news stories (text documents), and a corresponding stream of daily stock prices
Data Preprocessing
compute a percentage change
Results
expected value calculations and profit graphs aren’t really appropriate here