Please enable JavaScript.

Coggle requires JavaScript to display documents.

Chapter 10: Representing and Mining Text (Representation (Types (Corpus (a…

- - - - Applications
      - Medical Records
      - Product Inquiries
      - Consumer Complaint Logs
      - Repair Records
    - - Google, Bing, etc.
      - Reddit, Facebook, Twitter, etc.
      - Growing internet
        
        means growing data
  - - - for computers
      - often have linguistic structure
        
        Easy for humans
- - - - a collection of
        
        Documents
        
        Made up of
        
        tokens
        
        terms or words
        
        making up
        
        1 more item...
        
        No matter the size
        
        One piece of text
  - - - "Bag of Words"
  - - - using word count
        
        to calculate "Term Frequency"
        
        shows
        
        difference between use
        
        of all tokens in document
        
        want to know
        
        which terms are
        
        rare #
        
        common #
      - Inverse Document Frequency
        
        shows usage of terms
        
        to identify
        
        rare terms #
        
        common terms
        
        Decreases as
        
        words are more common
        
        Increases as
        
        words are more rare
      - TFIDF
        
        combines
        
        Term Frequency
        
        Inverse Document Frequency
        
        converts
        
        documents into
        
        feature vectors
        
        corpus into
        
        set of feature vectors
- - - - does not account for
        
        linguistic context
  - - - includes
        
        word sequencing
      - Ex:
        
        "The quick brown fox"
        
        "quick brown", "brown fox", etc.
    - - recognizes
        
        named entities
        
        in a document
      - difficult because
        
        often struggles with
        
        understanding context
    - - includes
        
        Topic Layer
        
        models set of topics
        
        of documents in corpus
        
        separately
        
        helpful for
        
        search engines
        
        because if keywords don't match
        
        results will still be relevant topic #