Please enable JavaScript.
Coggle requires JavaScript to display documents.
Provost Ch 10 (Bag of Words (Term Frequency (The next step up is to use…
Provost Ch 10
Bag of Words
the approach is to treat every document as just a collection of individual words. This approach ignores grammar, word order, sentence structure, and (usually) punctuation. It
treats every word in a document as a potentially important keyword of the document.
In the most basic approach, each word is a token,
and each document is represented by a one (if the token is present in the document) or a zero (the token is not present in the document).
Term Frequency
The next step up is to use the word count (frequency) in the document instead of just a zero or one.
-
Second, many words have been stemmed: their suffixes removed, so that verbs like announces, announced and announcing are all reduced to the term announc. transforms noun plurals to the singular forms
Finally, stopwords have been removed. A stopword is a very common word in English (or whatever language is being parsed). The words the, and, of, and on are considered stopwords in English
-
TFIDF
A very popular representation for text is the product of Term Frequency (TF) and Inverse Document Frequency (IDF)
TFIDF(t, d) = TF(t, d) × IDF(t)
The bag-of-words text representation approach treats every word in a document as an independent potential keyword (feature) of the document, then assigns values to each
document based on frequency and rarity. TFIDF is a very common value representation for terms, but it is not necessarily optimal. If someone describes mining a text corpus using bag of words it just means they’re treating each word individually as a feature. Their values could be binary, term frequency, or TFIDF, with normalization or without.
Why Text Is Difficult
-
Words can have varying lengths and text fields can have varying numbers of words. Sometimes word order matters, sometimes not.
-
context is important, much
more so than with other forms of data.
Representation
-
A document is one piece of text, no matter how large or small. A document could be a single sentence or a 100 page report, or anything in between
A document is composed of individual tokens or terms. For now, think of a token or term as just a word
-
Beyond Bag of Words
N-gram Sequences
In some cases, word order is important and you want
to preserve some information about it in the representation.
-
-
can be in "bags" of any number, its just adjacent word pairs, triplets, etc.
-
An advantage of using n-grams is that they are easy to generate; they require no linguistic knowledge or complex parsing algorithm.
The main disadvantage of n-grams is that they greatly increase the size of the feature set. There are far more word pairs than individual words, and still more word triples. The number of features generated can quickly get out of hand.
Named Entity Extraction
Sometimes we want still more sophistication in phrase extraction. We want to be able to recognize common named entities in documents.
Many text-processing toolkits include a named entity extractor of some sort. Usually these can process raw text and extract phrases annotated with terms like person or organization.
Unlike bag of words and n-grams, which are based on segmenting text on whitespace and punctuation, named entity extractors are knowledge intensive. To work well, they have to be trained on a large corpus, or hand coded with extensive knowledge of such names.
Topic Models
As before, each document constitutes a sequence of words, but instead of the words
being used directly by the final classifier, the words map to one or more topics.
Topics are learned from the data (often via unsupervised data mining). The final classifier is defined in terms of these intermediate topics rather than words.
One advantage is that in a search engine, for example, a query can use terms that do not exactly match the specific words of a document; if they map to the correct topic(s), the document
will still be considered relevant to the search.
-