Please enable JavaScript.

Coggle requires JavaScript to display documents.

Ch. 10 Representing and Mining Text (Types of approach in Text Mining (Bag…

- - - - Infers text does not have the sort of structure that we normally expect from data
    - - Synonyms and homographs
    - - Transform body of text into data
- - - - Ignores grammar, word order, sentence structure, and punctuation
    - - Use the word count
      - Importance of term increases with number of times used
    - - Two opposing considerations against term frequency
        
        Term not be too rare
        
        Term not be too common
      - Impose upper and lower limits for two opposing considerations
      - IDF = 1+ log (total # of documents/ # if documents containing t)
      - Relationship to entropy
        
        Binary term: p2 = 1-p1
        
        entropy = - p1 log(p1) - p2 log(p2)
        
        Express entropy as expected value of IDF(t)
    - - TFIDF(t,d)= TF(t, d) * IDF(t)
      - Specific to single document
        
        Each document becomes feature vector
      - Not necessarily optimal
      - Jazz musician exaple
  - - - bi-grams
- - - - Stream of news stories
      - Corresponding stream of daily stock prices
  - - - Gets rid of stories that are summaries