Please enable JavaScript.

Coggle requires JavaScript to display documents.

Ch. 10 Representing and Mining Text (Text Data (Representation…

- - - - user-generated content
        
        be able to understand it
  - - - linguistic structure
        
        not intended for computers
        
        understand context
        
        preprocessing before use as input
  - - - Document
        
        Tokens or Terms
        
        Collection of document
        
        Corpus
    - - create feature-vector form that is familiar
        
        Bag of words
        
        Treat every word in doc as individual
        
        Creates a set of words
        
        Term Frequency
        
        word count
        
        more a word is used the higher the importance
        
        Normalize data
        
        words are all same case
        
        stemmed
        
        stopwords removed
        
        Measure sparseness
        
        term shouldn't be too rare
        
        Impose a limit
        
        Term shouldn't be too common
        
        Impose a limit
        
        Inverse document frequency
        
        measures sparseness
        
        uses entire corpus
        
        Combine Term Freq. and Inverse Doc. Freq.
        
        Can use on single document
        
        1 more item...
- - - - particular phrase is significant
  - - - a clustering of words
    - - Learned through the process