CHAPTER 10 Representing and Mining Text
Representing and Mining Text
Most of this is borrowed from the field of Information
A document is one piece of text, no matter how large or small.
Typically, all the text of a document is considered
together and is retrieved as a single item when matched or categorized.
Combining Them: TFIDF
A very popular representation for text is the product of Term Frequency (TF) and Inverse Document Frequency (IDF), commonly referred to as TFIDF.
The TFIDF value
of a term t in a given document d is thus:TFIDF(t, d) = TF(t, d) × IDF(t)
Note that the TFIDF value is specific to a single document (d) whereas IDF depends on
the entire corpus.
Term counts within the documents form the TF values for each term, and the document counts across the corpus form the IDF values.
Each document thus becomes a feature vector, and the corpus is the set of these feature
Beyond Bag of Words
As presented, the bag-of-words representation treats every individual word as a term,
discarding word order entirely
N-grams are useful when particular phrases are significant but their component words
may not be.
The main idea of a topic layer is first to model the set of topics in a corpus separately
Why Text Is Important
Many legacy applications still produce or record text.
In business, understanding customer feedback often requires understanding text.
The next step up is to use the word count (frequency) in the document instead of just
a zero or one.
This allows us to differentiate between how many times a word is used
The purpose of term frequency is to represent the relevance of a term to a document
Long documents usually will have more words—and thus more word occurrences—than shorter ones.
Fundamental concepts: The importance of constructing mining-friendly data representations; Representation of text for data mining.
Exemplary techniques: Bag of words representation; TFIDF calculation; N-grams; Stemming; Named entity extraction; Topic models
Why Text Is Difficult
Text is often referred to as “unstructured” data.
This refers to the fact that text does not have the sort of structure that we normally expect for data: tables of records with fields having fixed meanings (essentially, collections of feature vectors), as well as links between the tables.
Text of course has plenty of structure, but it is linguistic structure intended for human consumption, not for computers.
As data, text is relatively dirty. People write ungrammatically, they misspell words, they run words together, they abbreviate unpredictably, and punctuate randomly.
Bag of Words
As the name implies, the
approach is to treat every document as just a collection of individual words.
It treats every word in a document as a potentially important keyword of the document.