CHAPTER 10 Representing and Mining Text (Beyond Bag of Words (The main…
Representing and Mining Text
Beyond Bag of Words
The main idea of a topic layer is first to model the set of topics in a corpus separately
N-grams are useful when particular phrases are significant but their component words
may not be.
As presented, the bag-of-words representation treats every individual word as a term,
discarding word order entirely
Combining Them: TFIDF
Each document thus becomes a feature vector, and the corpus is the set of these feature
Note that the TFIDF value is specific to a single document (d) whereas IDF depends on
the entire corpus.
Term counts within the documents form the TF values for each term, and the document counts across the corpus form the IDF values.
A very popular representation for text is the product of Term Frequency (TF) and Inverse Document Frequency (IDF), commonly referred to as TFIDF.
The TFIDF value
of a term t in a given document d is thus:TFIDF(t, d) = TF(t, d) × IDF(t)
The purpose of term frequency is to represent the relevance of a term to a document
Long documents usually will have more words—and thus more word occurrences—than shorter ones.
The next step up is to use the word count (frequency) in the document instead of just
a zero or one.
This allows us to differentiate between how many times a word is used
Bag of Words
As the name implies, the
approach is to treat every document as just a collection of individual words.
It treats every word in a document as a potentially important keyword of the document.
Typically, all the text of a document is considered
together and is retrieved as a single item when matched or categorized.
A document is one piece of text, no matter how large or small.
Most of this is borrowed from the field of Information
Why Text Is Difficult
Text is often referred to as “unstructured” data.
As data, text is relatively dirty. People write ungrammatically, they misspell words, they run words together, they abbreviate unpredictably, and punctuate randomly.
Text of course has plenty of structure, but it is linguistic structure intended for human consumption, not for computers.
This refers to the fact that text does not have the sort of structure that we normally expect for data: tables of records with fields having fixed meanings (essentially, collections of feature vectors), as well as links between the tables.
Why Text Is Important
In business, understanding customer feedback often requires understanding text.
Many legacy applications still produce or record text.
Fundamental concepts: The importance of constructing mining-friendly data representations; Representation of text for data mining.
Exemplary techniques: Bag of words representation; TFIDF calculation; N-grams; Stemming; Named entity extraction; Topic models