Ch. 10 Representing and Mining Text
text data
need to convert to meaningful form/coded as text
product reviews, customer feedbacks, email messages all are examples of "listening to the customer"
unstructured- doesn't have the sort of structure we expect from data: tables of records with fields having fixed meanings
text is linguistic structure- intended for human consumption and not computers
dirty data- misspelled, grammatical errors
context is crucial- sentiment
document- one piece of text/ sentence, 100 page report, youtube comment
composed of individual tokens or terms/ collection of documents is called a corpus
text representation task- taking a set of documents and turning it into a familiar feature-vector form
bag of words approach- treat every document as a collection of individual words/treats every word as a potentially important keyword of the document
frequency- allows to differentiate between how many times a word is used/ term frequency interpretation- importance of a term should increase with the number of times that term occurs
term should not be too rare and not too common
inverse document frequency- sparseness of a term t
TFIDF value is specific to a single document, IDF depends on the entire corpus
each document becomes a feature vector, corpus is a set of these features
n gram sequence- include sequences of adjacent words as terms
representing each document using as features its individual words, adjacent word pairs and word triples
disadvantage- greatly increase size of the feature set
named entity extraction- proper names instead of just the one word, knowledge intensive
topic layer- waning an additional layer between the document and the model. matrix factorization methods