Please enable JavaScript.
Coggle requires JavaScript to display documents.
Provost Chapter 10 (Text is important (its everywhere, need to understand…
Provost Chapter 10
-
-
Representation
-
taking a set of documents- each of which is a relatively free-form sequence of words and turning it into our familiar feature-vector form
-
Named entity extraction
sometimes we want still more sophisticated in phrase extraction. we want to be able to recognize common named entities in documents
they have to be trained on a large corpus, or hand coded with extensive knowledge of such names
Fundamental concepts
the importance of constructing mining friendly representations: representations of text for data mining
Exemplary techniques
bag of words representation: TFIDF calculation; N-grams; stemming; named entity extraction: topic models
bag of words
as the name implies, the approach it to treat every document as just a collection of individual words
-
-
-