Please enable JavaScript.

Coggle requires JavaScript to display documents.

Chapter 10: Representing and Mining Text (Representation (Term Frequency…

- - - - a word
  - - - lots of words
  - - - for word counts
    - - case is normalized
        
        no longer case sensitive
    - - stemming
        
        suffixes are removed
    - - stopwords removed
        
        the, and of, on
  - - - shouldn't be too rare
      - eliminated if it occurs too little
      - shouldn't be too common
      - overcommon are eliminated
    - - inverse doc frequency
        
        can be thought of as
        
        boost a term gets for being rare
  - - - term frequency (TF)
      - Inverse doc frequency (IDF)
    - - go into data mining algorithms
- - - - they create an expected value equation
- - - - even if
        
        individual words aren't
    - - greatly increase size of feature set
  - - - constitute propper names
    - - lots of training
      - or hand coding
  - - - layer btwn doc and model
      - model topics separately
    - - can use terms that
        
        don't directly match
        
        model can stay relevant
        
        even if it isn't perfect
    - - looks to cluster words
      - in theory (math is hard)