Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter 10: Representing and Mining Text (Text is just another form of…
Chapter 10: Representing and Mining Text
Text is just another form of data
Text processing is just a special case of representation engineering
Importance
Text is everywhere
Legacy applications
Social media
Medical Records
Old media
Business customer data
Difficult
unstructured data
Not the sort of structure you normally expect from data
Has structure but linguistic structure
intended for humans not computers
Dirty
Not always flawless (spelling, grammatical errors)
even when flawless it still may contain issues (synonyms, homographs, terminology)
Context is important
Postive? Negative?
Representation
text mining
simplest (least expensive) technique that works
Information retrieval
document is one piece of text no matter how large or small
Tokens or terms
a document is composed of (word)
Corpus
Collection of documents
Purpose
taking a set of documents and turning it into our familiar feature-vector form
Approaches
Bag of words
treat documents as a collection of individual words
Ignores grammar, word order, sentence structure, and punctuation
treats every word in the document as a important key word
Set = only one instance of each item
Bag = multiset
Is good and simple but sometimes not good enough
Term frequency
Next step = term frequency (word count)
Instead of zero and one
importance should increase with the times words are used
Measuring Sparseness: Inverse Document Frequency
weight of a term
How common a term is in an entire corpus
a term should not be too rare
A term should not be too common
Equation on inverse document frequency
Combining them: TFIDF
Combine term frequency and inverse document frequency
Multiply them together
Each document becomes a feature vector
Value representation but not optimal
N gram sequences
sequences of words as terms
adjacent pairs are called bi-grams
phrases are important but individual words are not
Easy to generate
Special consideration
Named entity
Knowledge intensive
have to be trained on a large corpus
or hand coded
quality of entity recognition can vary
Some extractors may have expertise
Topic models
Additional layer between the document and the model
Topic layer
Matrix factoization
Latent semantic indexing
Probalistic topic models
Most forms of data require some form of data representation
engineer data to match existing tools
Text, images, sound, video require special processing
Text into feature vector
common way = break into bag of words
and assign values with TFIDF
this approach is simple and inexpensive