Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 10: Representing and Mining Text (Representation (Types (Corpus (a…
Chapter 10: Representing and Mining Text
Text Is Important
Very common
from sources like
Applications
Medical Records
Product Inquiries
Consumer Complaint Logs
Repair Records
from Internet
Google, Bing, etc.
Reddit, Facebook, Twitter, etc.
Growing internet
means growing data
Very Difficult
unstructured
for computers
often have linguistic structure
Easy for humans
Dirty Data
Representation
Stems from Information Retrieval field
Types
Corpus
a collection of
Documents
Made up of
tokens
terms or words
making up
1 more item...
No matter the size
One piece of text
Difficulties
Unstructured
"Bag of Words"
Steps
Find Frequency
using word count
to calculate "Term Frequency"
shows
difference between use
of all tokens in document
want to know
which terms are
rare
#
common
#
Inverse Document Frequency
shows usage of terms
to identify
rare terms
#
common terms
Decreases as
words are more common
Increases as
words are more rare
TFIDF
combines
Term Frequency
Inverse Document Frequency
converts
documents into
feature vectors
corpus into
set of feature vectors
Beyond Bag of Words
bag of words
is easy because
does not account for
linguistic context
For complex documents
N-Gram Sequences
includes
word sequencing
Ex:
"The quick brown fox"
"quick brown", "brown fox", etc.
Name Entity Extraction
recognizes
named entities
in a document
difficult because
often struggles with
understanding context
Topic Models
includes
Topic Layer
models set of topics
of documents in corpus
separately
helpful for
search engines
because if keywords don't match
results will still be relevant topic
#