Please enable JavaScript.

Coggle requires JavaScript to display documents.

Data Science for Business By: Fawcett & Provost "CH 10:…

- - - - Produces table w/
        word count (frequency)
        
        Considerations
        
        Case type has
        been normalized
        
        Words have
        been stemmed
        
        Stopwords have
        been removed
        
        Measuring
        Sparseness
        
        Term should not
        be too rare
        
        Term should not
        be too common
        
        Inverse Document
        Frequency (IDF)
        
        Equation: 1 + log(total # of documents / # documents containing t)
        
        Relationship to
        Entropy
        
        IDF(not_t)
        
        TFIDF
        
        Product of Term Frequency (TF)
        and IDF
        
        Value is specific
        to a single document
        
        Each document becomes
        a feature vector
        
        Common representation
        but may not be optimal
    - - Apply basic stemming
      - Cosine Similarity
        Function
- - - - i.e. medical records,
        web pages,
        Twitter feeds, etc.
      - Business: essential
        for understanding
        customer feedback
  - - - Context is
        crucial
    - - Subjective to human
        behavior and error
- - - - Requires no
        linguisitc knowledge
      - No complex
        parsing algorithm
    - - Greatly increases size
        of the feature set
- - - - Stream of news
        stories
      - Stream of daily
        stock prices