Please enable JavaScript.

Coggle requires JavaScript to display documents.

03. Natural Language Processing - Coggle Diagram

- - - - Homonymy
        
        coincidentally share an orthographic form.
        e.g. financial institute, sloping mound
      - Polysemy
        Two senses are semantically related.
        e.g. solution
      - Homophone
        Same pronunciation, but different spellings. e.g. wood/would
      - Homograph
        Same orthographic form, but different pronunciation. e.g. base
      - Synonymy: two words are identical or nearly, e.g. car/automobile
      - Antonyms: two word with opposite meaning
        long/short, big/little
      - Hyponyms: more specific than another sense. e.g. dog is hyponym of animal
      - Hyperonyms: is the class of another sense
        e.g. animal is a hyperonym of dog
  - - - Mentioned-based models as ranking
        
        A classifier learns to identify a single antecedent for each referring expression i,
        
        where is a score for the mention pair (a, i)
        if a = ∈, then mention i does not refer to any previously introduced entity.
        
        Mention ranking is similar to the mention-pair model, but all candidates are considered
        simultaneously, and at most a single antecedent is selected
        
        For each mention i, we can define and antecedent , and an associated loss e.g., hinge
        loss or cross-entropy
      - Entity-based models
        
        It’s more realistic as coreference resolution is a clustering problem rather
        than classification or ranking
        
        Entity-based model require a scoring function at the entity level, e.g.,
        
        where z is is the entity referenced by mention i, and is a scoring function applied to all mentions i that are assigned to entity e.
      - Entity embedding
        
        Entity mentions can be embedded into a vector space, providing the
        base layer for neural networks that score coreference decisions
        
        Constructing the mention embedding:
        
        Run a bidirectional LSTM over the entire text, obtaining hidden states from the
        left-to-right and right-to-left passes
        
        Each candidate mention span (s, t) is then represented by the vertical
        concatenation of four vectors:
        
        is the embedding of the first word in the span
        
        is the embedding of the last word
        
        is the embedding of the
        head word
        
        is a vector of surface features, such as the length of the span
      - Using mention embedding
        
        Given a set of mention embedding, each mention i and candidate antecedent a
        is scored as,
        
        where u(a) and u(i) are the embeddings for spans a and i respectively,
- - - - Probabilistic Context-free grammar (PCFG)
        
        A parameter to each grammar rule
        
        Rule parameter:
        
        Find the most likely parse tree with the largest probability happen, T is set of all possible trees.
        
        Learning PCFG from Treebanks
        
        Penn treebank and English Web treebank
        
        maximum likelihood esitmation:
        
        Grammar Equivalence
        
        Two grammars are equivalent if they generate the same language (set of strings)
        
        Chomsky Normal Form (CNF)
        
        allow only two types of rules, the right-handside of each rule either has two non-terminals (A -> A, B) or one terminal (E -> a), except S -> empty set. A, B ∈ N (non-terminal), and a ∈ Σ (terminal).
        
        Top Down or bottom up parsing
      - Why context free grammar?
        
        recursive regular expressions. It shows the structure of the language. It can be expressed in a tree structure with four elements:
        
        Terminals: letters or words as building blocks of sentences;
        
        Non-terminals: variables that defined by grammar;
        
        Productions: set of rules;
        
        Start Symbol: root of a tree;
        
        It is used to parse expressions into a structure according to its grammar.
        
        Formal language natural language or computer language follows strict rules without considering the context.
        
        In programming language CFGs are used for for parsing and defining high-level structure of the language. This can help to understand structure and meaning.
- - - - Compute
      - Estimate the Probabilities
        
        May result zero, as the sentence may not appeared.
        
        Markov assumption: assume only most recent words are relevant
        
        Assumes only most recent words are relevant, and only deal with short sentences.
        
        Unigram Model
        
        Zero-order Markov assumption:
        
        Bigram Model
        
        First-order assumption:
        
        Can extend to 4-grams, 5-grams, but not too large as the training dataset may not be large enough for the higher grams.
        
        Compute Bigram probabilities
        
        Maximum likelihood estimation:
        
        By applying log() on probabilities, it turns product into sums of probabilities.
        
        logP(I want to eat) = logP(I) + logP(want|I) + logP(eat|to)
        
        This reduces computation than multiplication
        
        It improves numerical stability so it is not dealing with small numbers.
        
        Trigram Models
        
        Second order assumption
      - Overfitting
        
        In real lift test corpus is often different than the training corpus
        
        Generalisation is used to avoid zero probabilities.
        
        Interpolation
        
        Mix of lower order n-gram probabilities
        
        λ known as hyperparameters
        
        Biagram models （weighted sum of the bigram and unigram probability, which will give a probability by biagram or unigram)：
        
        Set λ
        
        Estimate λi on the held-out data
        
        One simple estimation:
        
        Absolute
        Discounting Interpolation
        
        Gives better interpolation, substract d to make bigram shifting to unigram estimate.
        
        Involves the interpolation of lower and higher order models
        
        Aims to deal with words that occur infrequently
        ）
        
        Typically d = 0.75 is used
        
        If there are only a few words that come after a context, then a novel word in that context should be less likely
        
        But, we also expect that if a word appears after a small number of contexts, then it should be less likely to appear in a novel context…
      - Kneser Ney Smoothing
        
        Use smoothing to improve Absolute
        Discounting Interpolation estimates
        
        Provides better estimates for probabilities of lower order unigrams
        P continuation (x) : How likely is a word to continue a new
        context?
        
        For each word x, count the number of bigram types it
        completes
        
        ie unigram probability is proportional to the number of
        different words it follows
        
        e.g. Pc(Fran) = 1/|Bigram Count|
        
        e.g. Pc(Glasses) = 3/|Bigram Count|
        
        Example (p22)
        
        Kneser-Ney Smoothing
        
        “smoothing” ~ adjusting low probs (such as zero probs) upwards, and high probs downward
        
        ie adjusting MLE to produce more accurate probabilities approaches
        
        Works very well, also used to improve NN
        
        definition for Bigrams
        
        where
        , the lamda is set this way to allow the probability to follow normal distribution and same as absolute discounting.
        
        The only difference is the Pcontiuation, which allows higher probability estimate if xi is after more words.
        
        Smoothing for Web-scale N-grams
        
        “Stupid Backoff”
        
        No discounting, just use relative frequencies
        
        Stupid Backoff
        
        A simplistic type of smoothing
        
        Inexpensive to train on large data sets, and
        approaches the quality of Kneser Ney
        Smoothing as the amount of training data
        increases
        
        Try use higher order n gram’s, otherwise drop to shorter sequence ie one less sequence, hence backoff!
        
        Every time you ‘backoff’, eg reduce from trigram to bigram because you’ve encountered a 0 probability, you multiply by 0.4
        
        ie use trigram if good evidence available, otherwise bigram, otherwise unigram!
        
        S = scores, not probabilities, so not sum up to one!
- - - - Nivre's Parsing Algorithm (Arc-eager)
        
        Transition-based
        
        Parser configuration <S (stack),I (list of remaining input words), A (set of current dependiencies)
        
        Input: a word sequence v = v1|...|vn, a set of rules R.
        
        R is a set of rules (e.g. verb->noun, adj->noun)
        
        Parser Transitions or operations
        
        <S (stack) |I (Input) |A (set of edges>
        
        Left-arc (LA): removing vi from the top of the stack(S) and add to the dependency edge.
        
        Right-arc (RA): opposite of LA, moving vj from input to top of the S.
        
        Reduce (R): Remove vi from the top of the S.
        
        Shift (S): take the first input and put on top of the Stack.
        
        Parsing Details
        
        Slight modication:
        
        conduct all of the parser operations from initial config <[ROOT], n, ∅> and terminates when ti reaches <[ROOT,nil,A>, without input and with set A and dependency graph
        
        Nondeterministic transitions
        
        priority ordering of transitions
        
        guided parsing
        
        Example on dependency parsing operation
        
        Properties of Nivre's Algorithm
        
        O(n): Linear time complexity
        
        Full dependency graphs are well formed
        
        Dependency Corpora
        
        CoNLL dependencies
        
        Stanford typed dependencies
        
        Universal dependencies
        
        Guided Parsing
        
        Train a classifier to predict parse transitions
        
        Feature space
        
        Neural networks for action prediction
        
        parser configurations->embedding look-up table-> hidden layers -> softmax
        
        use neural networks to work out which parsing action should be taken to successfully parse the sentence
- - - - Tokenisation
        
        Divide text into words, numbers, punctuation, and other symbols
        
        for alphabetic languages as English, French, Korean
        
        words are delimited by space
        
        deterministic methods are sufficient accurate
        
        Logo-graphic writing systems as Mandarin, Janpanese
        
        no psaces between words
        
        probabilistic methods are more accurate
        
        Mandarin Word Segmentation:
        
        sequence tagging computation problem
        
        LSTM-CRF architecture
        
        Divided in 4 classes: beginning, multi character segmentation, end, single-character segmentation)
      - Task: part of speech tagging
        
        part of speech (POS) refers to the syntactic role of each token in a sentence
        
        POS-tagging is the task of assigning POS-Tags to tokens, which are context dependent
        
        Tag based on context
        
        POS-tagging
        
        Sequence tagging computational problem
        
        LSTM architecture
        
        Three types of embeddings (embedding is process used in NLP where words or phrases from the vocabulary are mapped to vectors of real numbers):
        
        fine-tuned embeddings (updated during traning)
        
        pre-trained embeddings (never updated)
        
        character embeddings
        
        average accuracy = 96.5 %
      - Syntax parsing
        
        Syntax parsing is grammatical arrangement of words in a sentence and their relationship with each other.
        
        There are two major theories to describe syntax:
        
        Phrase structure grammar → constituency trees
        
        Dependency grammar → dependency trees
        
        Dependency parsing algorithms:
        
        Transition-based parsing (previous lecture)
        
        Graph-based parsing
        
        Transition-based parsing:
        
        as shift-reduce (actions) parsers -parse sentences from left to right, maintaining a “buffer” with not yet parsed words, and a “stack” of words whose head has not been seen or whose dependents have not all been fully parsed.
        
        train any multi-class machine learning classifier on features extracted from the stack, buffer,and previous arc actions in order to predict the next action.
        
        Graph-based parsing:
        
        takes into account all the possible trees.
        It’s slower and runs in Cubic time.
        
        use machine learning to assign a probability to each possible edge and then construct a maximum spanning tree (MST) from these weighted edges edges. The tree with the highest probabilities is ranked as the most accurate one.
      - Semantic Parsing
        
        Semantic parsing is the task of converting a natural language utterance to a logical form: a machine-understandable representation of its meaning.
        
        Parsing method to turn sentence into a logical form.
        
        Architecture
        
        generative encoder-decoder
        
        Algorithms
        
        sequenceTOsequence
        
        SequenceTOtree
        
        Challenges:
        
        How to encode syntax
        
        How to decode a hierarchical structure
        
        SequenceTOtree
        A two step decoder:
        
        one for the structure
        
        one for the words
        
        Input -> sequence encoder -> Attention layer -> Sequence/tree decoder -> logical form.
        
        Semantic parsing application:
        
        Question answering
        
        Logic representations
        
        Chatbots
        
        Automation
        
        Semantic parsing data sets
        
        Text to SQL (transfer text into SQL format)
        
        Advising: questions about university course advision
        
        Scholar: questions about academic publications
        
        Restaurants: questions about restaurants
        
        Semantic parsing is hard
        
        Domain dependent
        
        representation dependent
        
        Training general purpose semantic parsing is not possible.
        
        Possible solution:
        
        transfer learning (Transfer learning is a research problem in machine learning that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. For example, knowledge gained while learning to recognize cars could apply when trying to recognize trucks.)
        
        multi-task learning
        
        data augmentation
      - Discourse Parsing
        
        Discourse parsing is the task of identifying the relatedness and the particular discourse relations among various discourse units in a text
        
        Discourse parsing is the task of identifying the rhetorical structure of documents
        or sentences (a tree or a graph);
        
        It implies organizing text by means of relations that hold between parts of text.
        
        Tasks:
        1) Identify discourse units
        2) Link discourse units with discourse relations
        
        discourse relations is a finite set: cause, background, contrast, purpose, etc
        
        Discourse parsing can be inter-sentential (between document sections, paragraphs,
        sentences) or intra-sentential (between phrases in a sentence)
        
        There are several discourse theories, the most popular in NLP is the Rhetorical
        Structure Theory
        
        Algorithms for discourse parsing
        a) Bottom-up discourse parsing: compositional vector grammars
        b) Transitional-based discourse parsing: shift-reduce algorithm as in syntax
        parsing
        c) Segmenting intra-sentential discourse units
        c1) Train a classifier to determine whether a syntactic constituent is a discourse unit
        c2) Sequence labeling model with BIO encoding, e.g., CRF algorithm
        
        Applications of discourse parsing
        
        Extractive summarization
        
        Discourse classification
        
        Text Coherence:
        
        in text generation
        
        machine translation
        
        summarization
        
        Discourse parsing is hard → requires the analysis of phrases and
        sentences and other units semantically
        
        not much annotated data for discourse
        parsing, thus supervised learning is not always possible
        
        Rule-based systems are still very popular in this area. They facilitate the automatic and semiautomatic annoation of corpus that can be used later on for training
      - Named Entity Classification
        
        NER is the task of finding and classifying named entities in sentences, it gives the meaning of the name when tagging with the location in the context.
        
        sequence labeling computational problem
        
        LSTM-CRF architecture are the best performing ones
        
        Training data BIO (Beginning, Inside, Other) encoding: a more descriptive
        tag schema
        
        Challenges: ambiguity problems
        
        city vs person
        
        person vs month
        
        date vs quantity
        
        person vs organisation
        
        multi-language NER:
        
        Language independent features
        
        lack or too much capitalisation
        
        free word order languages
        
        languages with rich morphology
        
        NER Applications
        
        For more complex NLP tasks, such as question answering, text summarization, or machine translation:
        
        Faster Search Algorithms
        
        Customer support
        
        Content recommendation
        
        Help other NLP tasks
        
        Syntax parsing: by grouping named entities as single units
        
        Relation extraction: relations usually happend between entities
        
        Machine translation: many named entities should not be translated
      - Summary
        
        Different NLP task require a particular computational approach.
        
        Some NLP task are related to enrich the knowledge in text to allow language
        formalization and understanding:
        
        Tokenization, POS-Tagging, Syntax Parsing, NER
        
        Other tasks are closer to the final applications:
        
        Sentiment analysis, summarization
- - - - GloVe embeddings from AllenNLP
        
        RoBERTa large from AllenNLP
        
        Can have opposite sentiment results
  - - - Information Extraction approach:
        
        copy the most important information to the summary (e.g.: key phrases, clauses, sentences, paragraphs, etc.)
        
        Algorithms
        ✓ Hueristic-based
        ✓ Classification problem
        ✓ Ranking
        Steps
        1) Sentence ranking
        2) Sentence selection
        3) Sentence reformulation (in novel methods)
        4) Sentence ordering
      - Step 1: Relevance methods to assess which sentences are the most important
        
        Input: sentences
        
        Output: sentences are ranked according to their relevance
        
        Common relevance methods:
        
        keywords
        
        position
        
        titles
        
        indicative phrases
      - Step 2: Sentence selection as classification
        
        Each sentence is described by a set of features (as the previous step)
        
        Two classes: extract | do-not-extract (binary classification process)
        
        Algorithms:
        
        Regression models for importance prediction
        
        Learning to rank models that assign high ranks to important sentences
        
        Sequence labeling models: model inter-sentence dependency
      - The problem of redundancy
        
        Multi-document extractive summarization faces a problem of potential redundancy
        → extract sentences that are both "central" (i.e., contain the main ideas)
        → and "diverse" (i.e., they differ from one another)
        
        Maximal Marginal Relevance (MMR) algorithm is to use to model redundancy
        
        sim_1 = one query relevance
        
        sim_2 = novelty of information
        
        Unsupervised extractive summarisation
        
        category description
        
        documetns in category
        
        algorithm for unsupervised keyphrase extraction
        
        keyphrases with score
        
        re-ranked results using MMR
    - - Involves re-writing sentences
        
        Sentence paraphrasing or
        
        Sentence simplification or
        
        Sentence compression or
        
        or/and generation of novel content
        
        From graphs or tables to sentences
        1) Content determination (what information?)
        2) Text/Doc structuring (ordering)
        3) Sentence aggregation (merging sents. = readability, naturalness)
        4) Lexicalization (from concepts to words)
        5) Referring expressions generation (pronouns, anaphora)
        6) Realization (according to syntax and morphology)
        
        From sentences to sentences:
        
        syntax-based heuristics
        
        sequenceTOsequence models
- - - - Classification metrics
        
        Accuracy: the number of correct predictions, divided by the total number of instances
        
        Not suitable for class-imbalance datasets (most data sets are imbalanced)
      - Precision, recall, and F – MEASURE (same as in IR)
        There are two possible errors:
        
        False positive: the system incorrectly predicts the label
        
        False negative: the system incorrectly fails to predict the label
        
        There are two ways to be correct:
        
        True positive: the system correctly predicts the label
        
        True negative: the system correctly predicts that the label does not apply to this
        instance
      - Evaluating multi-class classification
        
        Macro F-MEASURE: when there are multiple labels of interest (e.g., in word sense disambiguation
        or emotion classification), it is necessary to combine the F–MEASURE across each class
      - Threshold free metric: ROC-AUC
        
        AUC: Area Under The Curve
        
        ROC: Receiver Operating Characteristics
        
        For binary classification, it allows to trade off between Precision and Recall
        
        AUROC of 0.5 (area under the red dashed line) corresponds to a coin flip, i.e. a useless model
        
        AUROC less than 0.7 is sub-optimal performance
        
        AUROC of 0.70 – 0.80 is good performance
        
        AUROC greater than 0.8 is excellent performance
        
        AUROC of 1.0 (area under the purple line) corresponds to a perfect
        classifier
      - BLEU (bilingual evaluation understudy): a modified form of precision to compare a
        candidate translation against multiple reference translations
        
        Perfect match = 1.0
        
        Perfect : mismatch = 0.0
        
        pn = No. of n-grams in both references and hypothesis translation/No. of n-grams in hypothesis translation
        
        Brevity penalty (BP) Precision-based metrics are biased in favor of short translations
        
        A reference translation and three system outputs.
        
        For each output, p n indicates the precision at each n-gram,
        and BP indicates the brevity penalty
      - Term overlap evaluation metrics
        
        Machine translation
        
        Text summarization
        
        Text simplification
        
        Q&A
        
        Chatbots
        
        ROUGE (Recall-Oriented Understudy for Gisting Evaluation): calculate the recall
        between human and automatic outputs in terms of n-grams (n-gram overlap)
        
        ROUGE-N: Overlap of n-grams between the system and a goldstandard reference
      - Evaluation examples:
        
        Semantic parsing
        
        Dependency parsing
        
        named entity recognition
        
        Ranking
        
        Term overlap metrics:
        
        machine translation
        
        text generation
        
        text simplification
      - Evaluation Methodology
        
        Always split in 3:
        
        training set
        
        validation set
        
        test set
        
        never test with training set
        
        with small dataset, use cross validation technique:
        
        Divide original data into k parts
    - - Issues regarding summarisation evaluation:
        
        humans usually do not agree in which are the most important sentences;
        
        There is more than one summary- Human enerated summaries are costly
- - - - Unsupervised learning:
        
        Unsupervised POS-tagging;
        
        Unsupervised dependency parsing
        
        Brown clustering
        
        Other ideas:
        
        language projectsion
        
        universal representations and interlinguas
      - Cross Lingual Transfer Learning:
        
        Transfer of annotations
        
        Such as POS tags, syntactic or semantic features via cross lingual bridges (e.g., word or phrase alignments)
        
        Transfer of models (similar to pre-trained model):
        
        Training a model in a resource rich language and applying it in a resource poor language in zero shot or one shot learning
        
        Transfer other parameters: features
      - Joint Multilingual or “Polyglot” Learning
        
        Resource rich and resource poor learning using a language universal representation
        
        Convert data in all languages to a shared representation (e.g., multilingual word vectors)
        
        Train a single model on a mix of datasets in all languages, to enable parameter sharing where possible