Please enable JavaScript.
Coggle requires JavaScript to display documents.
REPRESENTING & MINING TEXT ("Bag of Words" (to treat every…
REPRESENTING & MINING TEXT
TEXT DATA
it's everywhere
consumer complaint logs, product inquiries, and repair records
unstructured data
linguistic structure
"dirty" - different lengths, misspelled words
Document
one piece of text, no matter how large or small
composed of individual tokens or terms
Corpus - collection of documents
"Bag of Words"
to treat every document as just a collection of individual words
ignores grammar, word order, sentence structure
each word is a token
each document represented by 1 (if token is present in document) or 0 (if token is not present)
"Term Frequency"
use the word count (frequency) in the document instead of just 1 or 0
remove "stopwords" - common word in english
and, of, the
can impose upper and lower limits on term frequency
the fewer documents in which a term occurs, the more significant it likely is to be to the documents is does occur
inverse document frequency (IDF)
IDF(t) = 1 + log (total # of documents / # of documents containing t)
Term Frequency + IDF = TFIDF
TF(t,d)×IDF(t)
single document
N-Gram Sequences
include sequences of adjacent words as terms
Adjacent pairs are commonly called bi-grams
are useful when particular phrases are significant but their component words may not be
"Named Entity Extraction"
process raw text and extract phrases annotated with terms like person or organization
knowledge intensive