Please enable JavaScript.
Coggle requires JavaScript to display documents.
Provost Chapter 10 - Representing and Mining Text (Why It's Important,…
Provost Chapter 10 - Representing and Mining Text
Why It's Important
Text is everywhere and difficult to exploit
Understanding customer feedback requires understanding text
Why It's Difficult
Its "unstructured"
Text isn't presented and interpretable the same way something in a graph or chart is
Its dirty
People don't write with proper grammar, or spelling, they use weird abreviations etc.
The Approach
Use the simplest technique possible
Bag of Words
Treat every word within a document individually
Ignore grammer, sentence structure, etc
Term Frequency
Take all the words in a document and count them. Remove prefixes and suffixes, remove capital letters, remove end-words
Other Methods
N-gram Sequences
The consideration of sequences of adjacent words as terms
Easy to generate, require no linguistic knowledge, no complex parsing algorithm
Named Entity Extraction
The recognition of common named things ie Silicon Valley, New York City, etc.
Have to be trained on a large corpus or hand coded with extensive knowledge