Chapter 10: Representing and Mining Text (Representation (document = 1…
Chapter 10: Representing and Mining Text
: Word order is considered important to preserve information about the representation.
. N-grams are easy to generate and require no linguistic knowledge or complex parsing algorithm.
. We would include pairs of words made up by the adjacent words that form the document.
Bi-grams are adjacent pairs, and tri-grams are triplets
This a combination of Term Frequency and Inverse Document Frequency. TFIDF = TF * IDF.
While TFIDF is common, it is not always optimal.
The TFIDF is the value of a single document (d) whereas IDF depends on the entire Corpus.
Within this method, term counts are within the documents of TF values. The document counts from the IDF values
The importance and difficulty of text
'New media' has turned to the internet
email, twitter, Facebook, blog posts, product descriptions
Understanding customers responses usually come in the form of text and now can be seen through various rating systems
Text can have various lengths as well as the the spelling of responses
misspelled words, abbreviations, poor grammar
context can be hard to understand in text form.
Bag of words
Treats every document as a collection of individual words. This approach ignores grammar, word order, sentence structure, and punctuation.
Systems that use this process go through steps of stemming and stopword elimination before term counts.
. Each token is represented by a 1 or 0. This reduces the document down to the set of words.
. If one document contained a total of 50 words in it. Then the worth of the document would be 50.
A topic layer is an additional layer between the model and the document
The main reason for the topic layer is to separate the topics of the corpus
The words in the corpus are then mapped to the topics. Each word could fit one or more topics
. A search engine like google could take a string of words and search the internet to find the documents that contain that string.
general strategy is to use the simplest (least expensive) technique that works.
document = 1 piece of text.
no matter how small or how large. doesn't matter if its one sentence or 100 pages.
collection of documents is called a corpus