Please enable JavaScript.

Coggle requires JavaScript to display documents.

Text Mining - Coggle Diagram

- - - - Rule and dictionary based systems
      - Captures linguistics knowledge in rules written by experts
      - :check: Expert knowledge yields highly precise results
      - :red_cross: Shortage of experts
      - :red_cross: Labourious rule writing, dictionary preparation
      - :red_cross: Domain adaptation problematic
      - :check: Results can be interpreted
      - :check: Good when labelled data is hard to obtain
    - - Data-driven
      - Use of large amounts of (labelled) textual data to train systems, discover patterns
      - :check: Can generalise well on unseen examples
      - :red_cross: Need people for labelling
      - :red_cross: Time consuming and labourious labelling
      - :red_cross: Must retrain for new domain
      - :red_cross: Often cannot inspect/change models
      - :check: Good where dictionaries are unavailable
  - - - Considers language structure, linguistics
      - Requires knowledge of grammer
      - Enables machines to understand, process and generate human language
    - - Not necessarily looking at grammar
      - Possible relying only on frequencies to find correlations
      - Extracts patterns, trends, insights and knowledge from large text datasets
      - Usually seen to have more modest, engineering aims
- - - - Used for information retrieval
      - Can find similar documents as they will have similar words
      - Can find documents close to a query, representing a query as a document
  - - - Infrequent words
      - Correlated words
    - - Many word pairs that should have >0 word counts don't due to data sparsity
      - Laplace smoothing: adding 1 to every entry (pseudocount)
- - - - Each word w ∈ V is represented by a vector v ∈ R^d
      - There is a mechanism to compute the probability P(wt| u1, u2, ..., ul) of the event that a target word appears in a context.
    - - Find a vector v for each word w such that these probabilities are as high as possible for each w and its context
  - - - Word embeddings are learned from data, which means implicit biases are captured
      - Gender bias: "homemaker" closer to woman
      - Ethnic bias: African-American names are more often associated with unpleasant words
      - Debiasing embeddings is a hot research topic
  - - - f is continuous with respect to c (contextual embdeding e.g. ELMo, BERT)
      - f is discrete with respect to c (e.g. word sense disambiguation)
- - - - Delimiter-separated values (DSV)
      - JSON
  - - - Can handle annotations which are hierarchical (e.g. nested NEs, trees) and structured (e.g. events)
    - - Requires substantial processing with standard XML parsers
      - Impossible to encode overlapping/intersecting annotations
  - - - Original text is left untouched
      - Can handle structured, nested and overlapping annotations
    - - Not readily human readable
- - - - Morphosyntactic word
      - Punctuation mark or special symbol
      - A number
  - - - Sentence segmentation - splits
      - Tokenisation - splits
      - Named Entity Recognition - combines
- - - - Lexcial sample analysis
      - All words tasks (similar to POS tagging)
  - - - Target word w
      - Context words of w: wj
      - Lexicon definition of senses: D(.)
      - Set of senses of a word: S(.)
    - - Simple to implement
      - No training data needed
    - - Not all words have definitions in WordNet
      - Need to deal with ambiguous context words
      - Poor performance
  - - - Collocation features of words and n-grams
      - Weighted average of word embeddings (of all words in a window)
      - POS tags
- - - - Should be uppercase? Problem for some languages e.g. German
      - Permissible chars after potential EOS, e.g. lowercase characters?
    - - Titles not likely to occur at EOS e.g. Dr. Jones
      - Company indicators could occur at EOS e.g. MySocialMedia Inc
- - - - Distributional Hypothesis Similar contexts suggest similar meanings
- - - - takes two equal length vectors and returns a single number
    - - normalise dot product to between 0 and 1 by dividing with vector length (magnitude or norm) |x|