REPRESENTING & MINING TEXT

TEXT DATA

it's everywhere

consumer complaint logs, product inquiries, and repair records

unstructured data

linguistic structure

"dirty" - different lengths, misspelled words

Document

one piece of text, no matter how large or small

composed of individual tokens or terms

Corpus - collection of documents

"Bag of Words"

to treat every document as just a collection of individual words

ignores grammar, word order, sentence structure

each word is a token

each document represented by 1 (if token is present in document) or 0 (if token is not present)

"Term Frequency"

use the word count (frequency) in the document instead of just 1 or 0

remove "stopwords" - common word in english

and, of, the

can impose upper and lower limits on term frequency

the fewer documents in which a term occurs, the more significant it likely is to be to the documents is does occur

inverse document frequency (IDF)

Term Frequency + IDF = TFIDF

TF(t,d)×IDF(t)

IDF(t) = 1 + log (total # of documents / # of documents containing t)

single document

N-Gram Sequences

include sequences of adjacent words as terms

Adjacent pairs are commonly called bi-grams

are useful when particular phrases are significant but their component words may not be

"Named Entity Extraction"

process raw text and extract phrases annotated with terms like person or organization

knowledge intensive