Representing and Mining Text

Why Important

Representation

Why Difficult

Bag of Words

Term frequency

Beyond bag of words

N-gram sequence

Named entity

Topic models

Text most valuable media on internet

when people communicate with each other on the
Internet it is usually via text

understanding customer feedback often requires understanding text

text does not have the sort of structure that we normally expect for data

misspellings

nuance

As a result, context very important

use the simplest (least expensive) technique that works

A document is one piece of text, no matter how large or small

could be a single sentence or a 100 page report, or anything in between

A document is composed of individual tokens or terms

think of a token or term as just a word

A collection of documents is called a corpus

approach is to treat every document as just a collection of individual words

ignores grammar, word order, sentence structure, and (usually) punctuation

treats every word in a document as a potentially important keyword of the document

use the word count (frequency) in the document

frequency measures how prevalent a term is in a single document

text processing systems usually impose a small (arbitrary) lower limit on the number of documents in which a term must occur

term should not be too common

isn’t useful for classification (it doesn’t distinguish anything)

Overly common terms are typically eliminated

there are applications for which bag of words representation isn’t good enough

next step up in complexity is to include sequences of adjacent words as terms

pairs of adjacent words

N-grams are useful when particular phrases are significant but their component words might not be

main disadvantage of n-grams is that they greatly increase the size of the feature set

We want to be able to recognize common named entities in documents

in sequence they name
unique entities with interesting identities.

named entity extractors are knowledge intensive

have to be trained on a large corpus, or hand coded with extensive knowledge

sometimes we want an additional layer
between the document and the model

In the context of text we call this the topic layer

model the set of topics in a corpus separately

General methods for creating topic models include matrix factorization methods, such as Latent Semantic Indexing and Probabilistic Topic Models

the terms associated with the topic, and any term weights, are learned

they are not necessarily intelligible