Please enable JavaScript.
Coggle requires JavaScript to display documents.
Provost - Chapter 10 (Why Text is Difficult? (Unstructured Data - the text…
Provost - Chapter 10
Why Text is Difficult?
Unstructured Data - the text does not have the sort of structure that we normally expect for data (tables of records with fields)
-
Sometimes word order matters, sometimes not
-
People misspell words, write ungrammatically and run words together, abbreviate & punctuate randomly
-
-
-
TFIDF
-
TFIDF(t,d) = TF(t,d)*IDF(t)
TFIDF is specific to a single document, whereas IDF depends on the entire corpus.
-
Each document becomes a feature vector , & the corpus is the set of these feature vectors.
-
Very common, but not very optimal.
Term Frequency
-
-
-
Words have to be stemmed (announced, announces & announcing = anounc)
Stopwords are removed (and, the, of, on etc)
-
-
Topic Models
-
-
Each document constitutes a sequence of words, & the words map to one or more topics.
-
Advantage - in a search engine a query can use the terms that do not exactly match the specific words of a document; if they map to the correct topics ,the document will still be considered.
Methods for creating topic models include matrix factorization methods, such as Latent Semantic Indexing and Probabilistic Topic Models, such as Latent Dirichlet Allocation
Why is text important?
-
Medical records, consumer complaint logs, inquiries, repair records etc are all in text format
-
Facebook, twitter, reddit etc
-
Bag Of Words
-
It ignores grammar, word order, sentence structure & punctuation.
-
-
Each word is a token & each document is represented by a one(if token is present in document) or 0 (if not present)
-
Named Entity Extraction
-
The components of the words mean one thing (might not be significant) but in sequence they name unique entities.
-
They have to be trained on a large corpus, or hand coded with extensive knowledge of such names.
-