Please enable JavaScript.
Coggle requires JavaScript to display documents.
Provost Chapter 10 (Representation (a document is one piece of text (A…
Provost Chapter 10
Representation
-
-
-
Measuring Sparseness
we may care, when deciding the weight of a term, how common it is in the entire corpus we’re mining
not too rare, not too common
-
Why Text is Important
Medical records, consumer complaint logs, product inquiries, etc.
-
Why Text Is Difficult
“unstructured” data
text does not have the sort of structure that we normally expect for data: tables of records with fields having fixed meanings
-
context is important, much
more so than with other forms of data
TFIDF
Term Frequency (TF) and Inverse Document Frequency (IDF), commonly referred to as TFIDF.
TFIDF(t, d) = TF(t, d) × IDF(t)
Named Entity Extraction
We want to be able
to recognize common named entities in documents. Their component words mean one thing, and may not be significant, but in sequence they name unique entities with interesting identities.
Unlike bag of words and n-grams, which are based on segmenting text on whitespace
and punctuation, named entity extractors are knowledge intensive