Please enable JavaScript.

Coggle requires JavaScript to display documents.

ANL312 Text Mining and Applications - Coggle Diagram

- - - - Primary Literature * (PREFERRED)
      - Secondary Literature
      - People (Experts in field)
      - Organisations (Get Data)
    - - Online
        
        General search engines (e.g. Google, Yahoo, and Bing)
        
        Academic search engines (e.g. Google Scholar and CiteSeerX
        
        FAQs ( Frequently Asked Questions and forums
      - Offline
        
        Library (e.g. books and archives)
      - Strategy
- - - - Text
      - Corpus
      - Natural Language Processing
      - Parsing
        This is the thing I wanted to learn!!!
      - IN SPSS MODELLER
        
        1. Tokens (lowest)
        PHASE 1 Reading Text Files
        
        collections of characters or abbv. separated by spaces
        
        Terms
        PHASE 2: TEXT PHARSING
        
        Parts of speech (POS) Tagging
        
        Stemming
        
        Synonyms
        
        Exclude List
        
        tokens after NLP used to perform parts of speech tagging
        
        Catagories (TOP)
        PHASE 3: Categorising text
        
        Similar concepts
        
        Concepts
        PHASE 2: TEXT PHARSING (cont.)
        
        Concept and Type Extraction
        
        single or multiword terms after filtering non-informative terms via POS tags
  - - - Determining the unit of text analysis
        
        Sentence Level
        
        Paragraph Level
        
        Document Level
      - Tokenisation
        
        is a process of breaking up the units of text, for instance sentences or paragraphs, into individual tokens/words.
      - Removing stop words
        
        Not always needed
        If analysing sentence structure, may need to keep item #
        
        Removes common words ä, is, on, the etc
      - Stemming
        
        Maps all variants of word to root word
        
        Example: Promote
        (promoting, promotion, promotes)
      - Spelling normalisation
        
        corrects mispelled words
      - Case normalisation
        
        Case Normalisation converts the entire doc to upper or lower case. (So "A" and "a" becomes the same word)
      - Part of speech tagging
        
        Available for SPSS modeller
        
        Important to filter away non-informational tags
        
        Example: DT (Determiner: an, any, the, this)
      - Notes on sequencing?
        Link on sequencing (Link)