Please enable JavaScript.
Coggle requires JavaScript to display documents.
CH. 10: Representing and Mining Text (Ways to Represent Text (Inverse…
CH. 10: Representing and Mining Text
Why its important
It's everywhere
Needs to be converted into a meaningful form
User-generated content in Web 2.0 usually takes form of text
Why its difficult
it's unstructured (cannot be read by computer in meaningful way)
Its "dirty" meaning there are grammar errors, misspelling, incorrect punctuation, etc
Context of the words is important
Representation
Needs to be transformed into a set of data to be fed into a data mining algorithm
Document = one piece of text no matter the size
Tokens/Terms = a word (makes up a document)
Corpus = collection of documents
Ways to Represent Text
Bag of words
treat every document as a collections of individual words
Term Frequency
word count frequency in a document
importance of a term is increased with the number of times it is used
Steps to make a table
Make every word lowercase
remove suffixes (called stemming)
Stopwords removed (stopword = very common word like the, and, of, on, etc.)
Inverse Document Frequency
Term should not be too rare
Set an arbitrary lower limit
Term should not be too common
Set an arbitrary upper limit
The fewer documents a term occurs in = more significant the term is to the documents it occurs in
IDF(t) = 1 + log(total number of documents/number of documents containing t)
Term Frequency and IDF
TFIDF(t,d) = TF(t,d) X IDF (t)
TFIDF is specific to single document
IDF depends on entire corpus
N-Gram Sequences
sequences of adjacent words
adjacent pairs are called bi-grams
Disadvantage: greatly increase size of the feature set
Name Entity Extraction
Are knowledge intensive
knowledge has to be learned or coded by hand
Want to recognize common named entities i.e Game of Thrones
Topic Models
want an additional layer between document and model
Model the set of topics in a corpus correctly
words map to one or more topics
final classifier is determined in terms of topics rather than words