Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 10: Representing and Mining Text (Representation (document (one…
Chapter 10: Representing and Mining Text
Importance of Text
Its everywhere
most things online are text oriented
Contains relevant information on the individual
Feedback supplied in text
Difficulty of Text
Unstructured data
Intended for humans not computers
"dirty" data
ungrammatical, misspelling, runons, etc.
Assigning value to text
positive connotation
negative connotation
Representation
Use simplest working technique
document
one piece of text
Tokens/Terms
individual words
corpus
Collection of documents
Bag of words
Technique treating document as collection of individual words
Term Frequency
How many times a word is used
TFIDF
Term frequency inverse document frequency
Beyond Bag of Words
N-gram Sequences
Discard word order
adjacent words as terms
Name Entity Extraction
Increasing sophistication
Process raw text to extract annotated terms
Based on segmenting white space and punctuation
Topic models
Set topics in a corpus separately
Data learns the topic
Identifying statistical regularities