Please enable JavaScript.
Coggle requires JavaScript to display documents.
Ch. 10 Representing and Mining Text (text data (text is linguistic…
Ch. 10 Representing and Mining Text
text data
need to convert to meaningful form/coded as text
product reviews, customer feedbacks, email messages all are examples of "listening to the customer"
unstructured- doesn't have the sort of structure we expect from data: tables of records with fields having fixed meanings
text is linguistic structure- intended for human consumption and not computers
dirty data- misspelled, grammatical errors
context is crucial- sentiment
document- one piece of text/ sentence, 100 page report, youtube comment
composed of individual tokens or terms/ collection of documents is called a corpus
text representation task- taking a set of documents and turning it into a familiar feature-vector form
bag of words approach- treat every document as a collection of individual words/treats every word as a potentially important keyword of the document
frequency- allows to differentiate between how many times a word is used/ term frequency interpretation- importance of a term should increase with the number of times that term occurs
term should not be too rare and not too common
inverse document frequency- sparseness of a term t
TFIDF value is specific to a single document, IDF depends on the entire corpus
each document becomes a feature vector, corpus is a set of these features
n gram sequence- include sequences of adjacent words as terms
representing each document using as features its individual words, adjacent word pairs and word triples
disadvantage- greatly increase size of the feature set
named entity extraction- proper names instead of just the one word, knowledge intensive
topic layer- waning an additional layer between the document and the model. matrix factorization methods