Please enable JavaScript.
Coggle requires JavaScript to display documents.
Representing and Mining Data (Measuring Sparseness|Inverse Document…
Representing and Mining Data
Data in uninterpretable manner: Text Data
Why text is important
records, logs, etc
Twitter, FB, Reddit, Google
Understanding the customer, understanding the text
Why text is difficult
Unstructured data
Dirty data
Synonyms, homographs, spelling/grammar errors
Context is v important
Preprocessing is necessary
Representation
Information Retrieval
Document = one piece of text (no matter size)
Tokens or terms - words (phrases?)
Corpus = collection of documents
Bag of words
Every document is just a collection of words
ignores all other structure/form/grammar
is the token present?
1/0
Term of frequency
count of use
normalized; all in lowercase
iphone|IPHONE|iPhone
Stemmed: removing cases from words
announces|announced|announcing
stopwords
the|and|of|on
Measuring Sparseness|Inverse Document Frequency
Should not be too rare or or too common
IDF(t) = 1 + log (total number of documents / number of documents containing t)
TFIDF = TF(t,d)xIDF(t)
Feature selection
with normalization
Jazz musicians
Small corpus of 15 documents
stemming could remove necessary letters
Kansa or famou
Nearest neighbors/similarities with words
IDF and Entropy
Not expected value calculation
Beyond bag of words
N Gram sequences
Sequences of adjacent terms
commonly paired words with _
N-grams up to three
singles, doubles, and triples
they increase the size of the features set
Named entity extraction
Preprocessing components
HP, H-P, Hewlett Packard
Knowledge insensitive
must be learned or coded by hand
Oakland Raiders
Topic model
Topic layer
model sets of topics independently
Words map to one or more topics
semantic indexing
Probabilistic topic modeling