Representing and Mining Text
Why Important
Representation
Why Difficult
Bag of Words
Term frequency
Beyond bag of words
N-gram sequence
Named entity
Topic models
Text most valuable media on internet
when people communicate with each other on the
Internet it is usually via text
understanding customer feedback often requires understanding text
text does not have the sort of structure that we normally expect for data
misspellings
nuance
As a result, context very important
use the simplest (least expensive) technique that works
A document is one piece of text, no matter how large or small
could be a single sentence or a 100 page report, or anything in between
A document is composed of individual tokens or terms
think of a token or term as just a word
A collection of documents is called a corpus
approach is to treat every document as just a collection of individual words
ignores grammar, word order, sentence structure, and (usually) punctuation
treats every word in a document as a potentially important keyword of the document
use the word count (frequency) in the document
frequency measures how prevalent a term is in a single document
text processing systems usually impose a small (arbitrary) lower limit on the number of documents in which a term must occur
term should not be too common
isn’t useful for classification (it doesn’t distinguish anything)
Overly common terms are typically eliminated
there are applications for which bag of words representation isn’t good enough
next step up in complexity is to include sequences of adjacent words as terms
pairs of adjacent words
N-grams are useful when particular phrases are significant but their component words might not be
main disadvantage of n-grams is that they greatly increase the size of the feature set
We want to be able to recognize common named entities in documents
in sequence they name
unique entities with interesting identities.
named entity extractors are knowledge intensive
have to be trained on a large corpus, or hand coded with extensive knowledge
sometimes we want an additional layer
between the document and the model
In the context of text we call this the topic layer
model the set of topics in a corpus separately
General methods for creating topic models include matrix factorization methods, such as Latent Semantic Indexing and Probabilistic Topic Models
the terms associated with the topic, and any term weights, are learned
they are not necessarily intelligible