Please enable JavaScript.
Coggle requires JavaScript to display documents.
Ch.10 (representing and mining text (Fundamental concepts: The importance…
Ch.10
-
representation
First, some basic terminology. Most of this is borrowed from the field of Information Retrieval (IR). A document is one piece of text, no matter how large or small.
Bag of words
It is important to keep in mind the purpose of the text representation task. In essence, we are taking a set of documents—each of which is a relatively free-form sequence of words—and turning it into our familiar feature-vector form. Each document is one instance but we don’t know in advance what the features will be.
As the name implies, the approach is to treat every document as just a collection of individual words. This approach ignores grammar, word order, sentence structure, and (usually) punctuation. It treats every word in a document as a potentially important keyword of the document.
Term Frequency
The next step up is to use the word count (frequency) in the document instead of just a zero or one.
This allows us to differentiate between how many times a word is used; in some applications, the importance of a term in a document should increase with the number of times that term occurs.
-
combining them: TFIDF
A very popular representation for text is the product of Term Frequency (TF) and Inverse Document Frequency (IDF), commonly referred to as TFIDF. The TFIDF value of a term t in a given document d is thus:
TFIDF(t, d) = TF(t, d) × IDF(t)
Note that the TFIDF value is specific to a single document (d) whereas IDF depends on the entire corpus.
The bag-of-words text representation approach treats every word in a document as an independent potential keyword (feature) of the document, then assigns values to each document based on frequency and rarity. TFIDF is a very common value representation for terms, but it is not necessarily optimal.
Beyond bad of words
N-gram Sequences
As presented, the bag-of-words representation treats every individual word as a term, discarding word order entirely.
-
Named entity extraction
Many text-processing toolkits include a named entity extractor of some sort. Usually these can process raw text and extract phrases annotated with terms like person or organization.
Unlike bag of words and n-grams, which are based on segmenting text on whitespace and punctuation, named entity extractors are knowledge intensive.
Topic Models
The resulting model—whatever it may be—refers directly to words. Learning such direct models is relatively efficient, but is not always optimal. Because of the complexity of language and documents, sometimes we want an additional layer between the document and the model. In the context of text we call this the topic layer
-
why text is important
-
Medical records, consumer complaint logs, product inquiries, and repair records are still mostly intended as communication between people, not computers, so they’re still “coded” as text.
why text is difficult
Text is often referred to as “unstructured” data. This refers to the fact that text does not have the sort of structure that we normally expect for data: tables of records with fields having fixed meanings (essentially, collections of feature vectors), as well as links between the tables.
As data, text is relatively dirty. People write ungrammatically, they misspell words, they run words together, they abbreviate unpredictably, and punctuate randomly.
-