Ch. 10 Representing and Mining Text

text data

need to convert to meaningful form/coded as text

product reviews, customer feedbacks, email messages all are examples of "listening to the customer"

unstructured- doesn't have the sort of structure we expect from data: tables of records with fields having fixed meanings

text is linguistic structure- intended for human consumption and not computers

dirty data- misspelled, grammatical errors

context is crucial- sentiment

document- one piece of text/ sentence, 100 page report, youtube comment

composed of individual tokens or terms/ collection of documents is called a corpus

text representation task- taking a set of documents and turning it into a familiar feature-vector form

bag of words approach- treat every document as a collection of individual words/treats every word as a potentially important keyword of the document

frequency- allows to differentiate between how many times a word is used/ term frequency interpretation- importance of a term should increase with the number of times that term occurs

term should not be too rare and not too common

inverse document frequency- sparseness of a term t

TFIDF value is specific to a single document, IDF depends on the entire corpus

each document becomes a feature vector, corpus is a set of these features

n gram sequence- include sequences of adjacent words as terms

representing each document using as features its individual words, adjacent word pairs and word triples

disadvantage- greatly increase size of the feature set

named entity extraction- proper names instead of just the one word, knowledge intensive

topic layer- waning an additional layer between the document and the model. matrix factorization methods