Please enable JavaScript.
Coggle requires JavaScript to display documents.
03. Natural Language Processing - Coggle Diagram
03. Natural Language Processing
NLP 1 - Semantics
Meaning of language
Linguistics and logic concerned with meaning
Translation, fact-checking, logic checking
From the linguist point of view meaning can be modeled in several ways:
using logic
using relations between words (synonyms, hyperonyms, etc.)
In NLP we use formalism (first order logic, predicate structures, etc.) and resources (WordNet, DBPedia, etc.) to enrich text with meaning and knowledge about the world.
Modeling reference expressions add another layer of knowledge to the text, and it’s crucial in applications as machine translation and automatic summarization.
Meaning
Sign: object (apple)
Signifier: physical existence (red)
Signified: mental concept (fruit)
Only one possible interpretation
Link language to external knowledge
Support conputational inference
Expressive enough
Logical semantics
Analyze a sentence is to convert it into a meaning representation
Inference
Support is to automatically deduce new facts from premises, e.g. first order logic.
e.g. if A is true, then B is true.
A is true,
Inference B is true.
Semantic Parsing
In NLP, semantic parsing is the transformation of sentences to meaning representation.
Traditionally baed on syntax structures
Nowadays is asequenceTosequence problem using deep learning
Logical formulas are hierarchical, thus sequenceTotree algorithms are also used.
Lexical Semantics:
Linguistic study of word meaning
Terminology:
Word sense: a discrete representation of one aspect of the meaning or a word. Bank can have different meanings.
Homonymy
coincidentally share an orthographic form.
e.g. financial institute, sloping mound
Polysemy
Two senses are semantically related.
e.g. solution
Homophone
Same pronunciation, but different spellings. e.g. wood/would
Homograph
Same orthographic form, but different pronunciation. e.g. base
Synonymy: two words are identical or nearly, e.g. car/automobile
Antonyms: two word with opposite meaning
long/short, big/little
Hyponyms: more specific than another sense. e.g. dog is hyponym of animal
Hyperonyms: is the class of another sense
e.g. animal is a hyperonym of dog
WordNet
A database of lexical (ontological) relations
Group of words into sets of Synonyms called synsets.
Synsets: teh set of near-synonyms for a sense
Dbpedia
Huge lexical graph.
Use RDF triplets to encode relatinos between entities.
Predicate argument semantics
Considered a light semantic representation
A predicate is seen as a property that a subject has or is characterised by. Therefore is an expression that can be true for something.
Predicates have arguments:
(arg1:sb) dance (arg2:sb)
Model in NLP
PropBank
FrameNet
VerbNet
Learn representations for word meanings from unlabeled data
This idea is based on the theoretical principle of the distributional hypothesis:
Distributional semantics are computed from context statistics (e.g. BROWN Clusters)
Distributed semantics represent meaning by numerical vectors, rather than symbolic structures (LDA, WORD2VEC)
Brown Clustering
CRL and perceptron often perform better with discrete feature vectors
Discrete representations can be distilled from distributional vectors by clustering
BROWN clusters induce hierarchical representations from distributional vectors
Word Embedding
Word embedding;
Character embedding
Sentence embedding
Reference Resolution
Reference resolution aims to model referential ambiguity
Referring expressions: pronouns, proper nouns, and nominals
Algorithms for Coreference Resolution
Structure prediction problem with two tasks:
1) identifying which spans of text mention entities
2) clustering those mentions
mentioned-based models: supervised learning or ranking
entity-based models:clustering
Mentioned-based models are supervised learning
Mention-pair models: binary label is assigned to each pair of mentions (i,j), where i<j. If i and j corefer (zi = zj)
then, yi,j = 1
otherwise, yi,j = 0
Mentioned-based models as ranking
A classifier learns to identify a single antecedent
for each referring expression i,
where
is a score for the mention pair (a, i)
if a = ∈, then mention i does not refer to any previously introduced entity.
Mention ranking is similar to the mention-pair model, but all candidates are considered
simultaneously, and at most a single antecedent is selected
For each mention i, we can define and antecedent , and an associated loss e.g., hinge
loss or cross-entropy
Entity-based models
It’s more realistic as coreference resolution is a clustering problem rather
than classification or ranking
Entity-based model require a scoring function at the entity level, e.g.,
where z is is the entity referenced by mention i, and
is a scoring function applied to all mentions i that are assigned to entity e.
Entity embedding
Entity mentions can be embedded into a vector space, providing the
base layer for neural networks that score coreference decisions
Constructing the mention embedding:
Run a bidirectional LSTM over the entire text, obtaining hidden states from the
left-to-right and right-to-left passes
Each candidate mention span (s, t) is then represented by the vertical
concatenation of four vectors:
is the embedding of the first word in the span
is the embedding of the last word
is the embedding of the
head word
is a vector of surface features, such as the length of the span
Using mention embedding
Given a set of mention embedding, each mention i and candidate antecedent a
is scored as,
where u(a) and u(i) are the embeddings for spans a and i respectively,
NLP 2 - Parsing
Syntactic Analysis – determining the syntactic structure of text by analyzing the underlying grammar
Syntax = how words combine to form phrases and sentences
Gives deeper understanding of word groups and their
grammatical relationships
• Formally tries to resolve structural ambiguity in text
In the broad context of the NLP Pipeline:
Tokenize → POS Tag (noune, adj) → Parse (combines the word using a tree) →...
Wide range of applications
Constituency parsing
Phrases represented as nodes in a tree, favor languages
Adds more structure to POS tagged sentences
Splits sentences into sub-phrases or constituents
Phrase Structure Trees:
type of phrases = non-terminals
words in the sentence = terminals
Consitiuent: a word or a group of words that behaves as a single unit.
Preposed or postposed construction can keep the meaning of the text.
Context free grammar consists of:
A set of context-free rules, each of which expresses the ways that symbols of the language can be grouped and ordered together.
Lexicon of words and symboles are the terminal words, they are the building blocks of a constituency parser.
A formal definition of CFG:
N is a set of non-terminals.
Σ is a set of terminals
R is a set of rules (productions), each of the form A -> B.
S is a designated start symbol as a root of the tree
The sequence of rule expansions is called a derivation of the string of words.
The sequence of the terminal is the sequence of the original string.
Parsing a sentence means that finding a path that produces the sentence and the path has to follow the grammar. A tree is used to parse a sentence.
Probabilistic Context-free grammar (PCFG)
A parameter to each grammar rule
Rule parameter:
Find the most likely parse tree with the largest probability happen, T is set of all possible trees.
Learning PCFG from Treebanks
Penn treebank and English Web treebank
maximum likelihood esitmation:
Grammar Equivalence
Two grammars are equivalent if they generate the same language (set of strings)
Chomsky Normal Form (CNF)
allow only two types of rules, the right-handside of each rule either has two non-terminals (A -> A, B) or one terminal (E -> a), except S -> empty set. A, B ∈ N (non-terminal), and a ∈ Σ (terminal).
Top Down or bottom up parsing
Why context free grammar?
recursive regular expressions. It shows the structure of the language. It can be expressed in a tree structure with four elements:
Terminals: letters or words as building blocks of sentences;
Non-terminals: variables that defined by grammar;
Productions: set of rules;
Start Symbol: root of a tree;
It is used to parse expressions into a structure according to its grammar.
Formal language natural language or computer language follows strict rules without considering the context.
In programming language CFGs are used for for parsing and defining high-level structure of the language. This can help to understand structure and meaning.
Dependency Parsing
− Dependencies between words, faster
NLP 3 - Language Models
Language Models
Goal: assign a probability to a word sequence.
speech recognition
spelling correction
collocation error correction
mechine translation
question-answering, summarisation
Probability language modeling
A lanuage model computes the probability of a sequence of words
Sum of the probability add up to one.
Related task: probability of an upcoming word
Language model can build a conditional probability (easier):
or it can build one at a time:
Ensure that the same sentence is not sampled each time.
Compute
Estimate the Probabilities
May result zero, as the sentence may not appeared.
Markov assumption: assume only most recent words are relevant
Assumes only most recent words are relevant, and only deal with short sentences.
Unigram Model
Zero-order Markov assumption:
Bigram Model
First-order assumption:
Can extend to 4-grams, 5-grams, but not too large as the training dataset may not be large enough for the higher grams.
Compute Bigram probabilities
Maximum likelihood estimation:
By applying log() on probabilities, it turns product into sums of probabilities.
logP(I want to eat) = logP(I) + logP(want|I) + logP(eat|to)
This reduces computation than multiplication
It improves numerical stability so it is not dealing with small numbers.
Trigram Models
Second order assumption
Overfitting
In real lift test corpus is often different than the training corpus
Generalisation is used to avoid zero probabilities.
Interpolation
Mix of lower order n-gram probabilities
λ known as hyperparameters
Biagram models (weighted sum of the bigram and unigram probability, which will give a probability by biagram or unigram):
Set λ
Estimate λi on the held-out data
One simple estimation:
Absolute
Discounting Interpolation
Gives better interpolation, substract d to make bigram shifting to unigram estimate.
Involves the interpolation of lower and higher order models
Aims to deal with words that occur infrequently
)
Typically d = 0.75 is used
If there are only a few words that come after a context, then a novel word in that context should be less likely
But, we also expect that if a word appears after a small number of contexts, then it should be less likely to appear in a novel context…
Kneser Ney Smoothing
Use smoothing to improve Absolute
Discounting Interpolation estimates
Provides better estimates for probabilities of lower order unigrams
P continuation (x) : How likely is a word to continue a new
context?
For each word x, count the number of bigram types it
completes
ie unigram probability is proportional to the number of
different words it follows
e.g. Pc(Fran) = 1/|Bigram Count|
e.g. Pc(Glasses) = 3/|Bigram Count|
Example (p22)
Kneser-Ney Smoothing
“smoothing” ~ adjusting low probs (such as zero probs) upwards, and high probs downward
ie adjusting MLE to produce more accurate probabilities approaches
Works very well, also used to improve NN
definition for Bigrams
where
, the lamda is set this way to allow the probability to follow normal distribution and same as absolute discounting.
The only difference is the Pcontiuation, which allows higher probability estimate if xi is after more words.
Smoothing for Web-scale N-grams
“Stupid Backoff”
No discounting, just use relative frequencies
Stupid Backoff
A simplistic type of smoothing
Inexpensive to train on large data sets, and
approaches the quality of Kneser Ney
Smoothing as the amount of training data
increases
Try use higher order n gram’s, otherwise drop to shorter sequence ie one less sequence, hence backoff!
Every time you ‘backoff’, eg reduce from trigram to bigram because you’ve encountered a 0 probability, you multiply by 0.4
ie use trigram if good evidence available, otherwise bigram, otherwise unigram!
S = scores, not probabilities, so not sum up to one!
Evaluation of Language Models
Extrinsic evaluation
Put each model in a task and how well to help with the model.
Time consuming
Intrinsic evaluation (how well it fits the model)
Perplexity, the smaller the higher the probability for the test set and better the model:
Weighted average number of possible next words that can follow a given word
Normalised likelihood the lower the better
L = length of sequence
Works well when test and training data similar
NLP 4 - Dependency Parsing
Dependency Parsing
Aims to find dependencies between words and their type
All the words, except one, have some relationship or dependency on other words in the sentence
Root = word with no dependency, usually a verb
Dependencies = all other words, which are directly or indirectly linked to the root word
In a dependency tree:
Nodes represent words
Edges represent dependencies
Analyses the grammatical structure of sentences, by establishing relationships between “head” words and “modifier” words, which change them
Dependency parsing: task of mapping an input string to a dependency graph satisfying certain conditions
Dependency trees D = (V,E):
V is the set of nodes (words in input sequence)
E is the set of arcs indicating grammatical relations
vi->vj or (vi,vj) belongs to E denotes an arc from head vi to dependent vj.
English language has very few non-projective cases in dependency tree (crossing lines)
Well Formedness
A dependency graph is well formed if
Single head : Each word has only one head
Acyclic : The graph should be acyclic ie not
forming a part of a circle
Connected : There is a path between any pairs of nodes
Projective: if an edge from word A to word B implies that there exists a directed path in the graph from A to every word between A and B in the sentence
Parsing Algorithms
Graph based parsing
Transition based parsing
Nivre's Parsing Algorithm (Arc-eager)
Transition-based
Parser configuration <S (stack),I (list of remaining input words), A (set of current dependiencies)
Input: a word sequence v = v1|...|vn, a set of rules R.
R is a set of rules (e.g. verb->noun, adj->noun)
Parser Transitions or operations
<S (stack) |I (Input) |A (set of edges>
Left-arc (LA): removing vi from the top of the stack(S) and add to the dependency edge.
Right-arc (RA): opposite of LA, moving vj from input to top of the S.
Reduce (R): Remove vi from the top of the S.
Shift (S): take the first input and put on top of the Stack.
Parsing Details
Slight modication:
conduct all of the parser operations from initial config <[ROOT], n, ∅> and terminates when ti reaches <[ROOT,nil,A>, without input and with set A and dependency graph
Nondeterministic transitions
priority ordering of transitions
guided parsing
Example on dependency parsing operation
Properties of Nivre's Algorithm
O(n): Linear time complexity
Full dependency graphs are well formed
Dependency Corpora
CoNLL dependencies
Stanford typed dependencies
Universal dependencies
Guided Parsing
Train a classifier to predict parse transitions
Feature space
Neural networks for action prediction
parser configurations->embedding look-up table-> hidden layers -> softmax
use neural networks to work out which parsing action should be taken to successfully parse the sentence
Dependency Structures vs. Phrase Structures
Dependency structures explicitly represent
Head dependent relations ( directed arcs)
Functional categories ( arc labels)
predicate argument structure
Dependency structure independent of word order
Suitable for free word order languages, such as Indian languages
Phrase structures explicitly represent
Phrases ( non terminal nodes)
Structural categories ( non terminal labels)
Fragments are directly interpretable
Syntactic structure consists of lexical items , linked by binary asymmetric relations called dependencies
head->dependent
head (governor) : grammatically most important
dependent (modifier): modifier, object, or complement
With or without labels (types of dependencies, eg. modifier, argument)
NLP in Practice_1
Structure in a document/sentence/phrase
Model documents with different structures:
bag-of-word
sequence: we speak in squences
tree
graph
Each structure give a representation:
simple structure, bag-of-word, efficient for search, retrieval, and text classification
complex structures as trees are needed to model sytax, semantics, and logic, among others.
Target Tasks
NLP tasks require a particular computational approach:
classification
document relevance
sequence labeling
strucutre prediction
ranking
Tokenisation
Divide text into words, numbers, punctuation, and other symbols
for alphabetic languages as English, French, Korean
words are delimited by space
deterministic methods are sufficient accurate
Logo-graphic writing systems as Mandarin, Janpanese
no psaces between words
probabilistic methods are more accurate
Mandarin Word Segmentation:
sequence tagging computation problem
LSTM-CRF architecture
Divided in 4 classes: beginning, multi character segmentation, end, single-character segmentation)
Task: part of speech tagging
part of speech (POS) refers to the syntactic role of each token in a sentence
POS-tagging is the task of assigning POS-Tags to tokens, which are context dependent
Tag based on context
POS-tagging
Sequence tagging computational problem
LSTM architecture
Three types of embeddings (embedding is process used in NLP where words or phrases from the vocabulary are mapped to vectors of real numbers):
fine-tuned embeddings (updated during traning)
pre-trained embeddings (never updated)
character embeddings
average accuracy = 96.5 %
Syntax parsing
Syntax parsing is grammatical arrangement of words in a sentence and their relationship with each other.
There are two major theories to describe syntax:
Phrase structure grammar → constituency trees
Dependency grammar → dependency trees
Dependency parsing algorithms:
Transition-based parsing (previous lecture)
Graph-based parsing
Transition-based parsing:
as shift-reduce (actions) parsers -parse sentences from left to right, maintaining a “buffer” with not yet parsed words, and a “stack” of words whose head has not been seen or whose dependents have not all been fully parsed.
train any multi-class machine learning classifier on features extracted from the stack, buffer,and previous arc actions in order to predict the next action.
Graph-based parsing:
takes into account all the possible trees.
It’s slower and runs in Cubic time.
use machine learning to assign a probability to each possible edge and then construct a maximum spanning tree (MST) from these weighted edges edges. The tree with the highest probabilities is ranked as the most accurate one.
Semantic Parsing
Semantic parsing is the task of converting a natural language utterance to a logical form: a machine-understandable representation of its meaning.
Parsing method to turn sentence into a logical form.
Architecture
generative encoder-decoder
Algorithms
sequenceTOsequence
SequenceTOtree
Challenges:
How to encode syntax
How to decode a hierarchical structure
SequenceTOtree
A two step decoder:
one for the structure
one for the words
Input -> sequence encoder -> Attention layer -> Sequence/tree decoder -> logical form.
Semantic parsing application:
Question answering
Logic representations
Chatbots
Automation
Semantic parsing data sets
Text to SQL (transfer text into SQL format)
Advising: questions about university course advision
Scholar: questions about academic publications
Restaurants: questions about restaurants
Semantic parsing is hard
Domain dependent
representation dependent
Training general purpose semantic parsing is not possible.
Possible solution:
transfer learning (Transfer learning is a research problem in machine learning that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. For example, knowledge gained while learning to recognize cars could apply when trying to recognize trucks.)
multi-task learning
data augmentation
Discourse Parsing
Discourse parsing is the task of identifying the relatedness and the particular discourse relations among various discourse units in a text
Discourse parsing is the task of identifying the rhetorical structure of documents
or sentences (a tree or a graph);
It implies organizing text by means of relations that hold between parts of text.
Tasks:
1) Identify discourse units
2) Link discourse units with discourse relations
discourse relations is a finite set: cause, background, contrast, purpose, etc
Discourse parsing can be inter-sentential (between document sections, paragraphs,
sentences) or intra-sentential (between phrases in a sentence)
There are several discourse theories, the most popular in NLP is the Rhetorical
Structure Theory
Algorithms for discourse parsing
a) Bottom-up discourse parsing: compositional vector grammars
b) Transitional-based discourse parsing: shift-reduce algorithm as in syntax
parsing
c) Segmenting intra-sentential discourse units
c1) Train a classifier to determine whether a syntactic constituent is a discourse unit
c2) Sequence labeling model with BIO encoding, e.g., CRF algorithm
Applications of discourse parsing
Extractive summarization
Discourse classification
Text Coherence:
in text generation
machine translation
summarization
Discourse parsing is hard → requires the analysis of phrases and
sentences and other units semantically
not much annotated data for discourse
parsing, thus supervised learning is not always possible
Rule-based systems are still very popular in this area. They facilitate the automatic and semiautomatic annoation of corpus that can be used later on for training
Named Entity Classification
NER is the task of finding and classifying named entities in sentences, it gives the meaning of the name when tagging with the location in the context.
sequence labeling computational problem
LSTM-CRF architecture are the best performing ones
Training data BIO (Beginning, Inside, Other) encoding: a more descriptive
tag schema
Challenges: ambiguity problems
city vs person
person vs month
date vs quantity
person vs organisation
multi-language NER:
Language independent features
lack or too much capitalisation
free word order languages
languages with rich morphology
NER Applications
For more complex NLP tasks, such as question answering, text summarization, or machine translation:
Faster Search Algorithms
Customer support
Content recommendation
Help other NLP tasks
Syntax parsing: by grouping named entities as single units
Relation extraction: relations usually happend between entities
Machine translation: many named entities should not be translated
Summary
Different NLP task require a particular computational approach.
Some NLP task are related to enrich the knowledge in text to allow language
formalization and understanding:
Tokenization, POS-Tagging, Syntax Parsing, NER
Other tasks are closer to the final applications:
Sentiment analysis, summarization
NLP in Practice_2
Task: Relation extraction:
Relationship extraction is the task of extracting semantic relationships from a text. Extracted relationships usually occur between two or more entities of a certain type (e.g. Person, Organisation, Location) and fall into a number of semantic categories (e.g. married to, employed by, lives in).
How relations are expressed in natural language
Relations are instantiated by predicates
Predicates have arguments
Verbs are the most productive predicate form
Why relation extraction
Create new structured knowledge
Augment current knowledge bases
Adding words to WordNEt thesaurus
find semantic relationships between pairs of mentions of entities.
RE is essential for many downstream tasks such as knowledge base completion and question answering.
support question answering
Which relations to extract and how
A pre-defined set of relations
Open relation extraction
Extracting a pre-defined set of relations
Sequence labeling problem
Classification algorithm
Supervised relation extraction between entities
Find all pairs of named entities (person, location, organisation, etc.)
Decide if 2 entities are related
If yes, classify the relation into relation types
Can use different classifier:
Naive Bayes, CRF, SVM, CNN, etc.
Open Relation Extraction: all relations
Extract relations and their arguments from syntax trees
Lots of unsupervised approaches
New trend, as a sequence labeling prolem
Unsupervised reduction extraction:
Preprocessing ->entity eatures/sentence features -> PCA -> HAC clustering
Task:Sentiment analysis
Also know as opinion polarity is a text classification problem.
Marketers:
Social scientist
Digital humanity scientist
Health experts
text classify in three classes:
positive
negative
neutral
Long documents:bag of words model works well.
short document: bag of bigrams
Treating negation is important
But can also have many opposite sentiment in the sentence.
GloVe embeddings from AllenNLP
RoBERTa large from AllenNLP
Can have opposite sentiment results
Task: Word sense disambiguation
Word sense disambiguation is the task of identifying the intended sense of each word token in a sentence
It requires first identifying the correct part-of-speech and lemma for each token to begin with
Multi-class classification:
classify a given lemma into a pre-defined set of possible senses
usually senses are taken from WordNet
modeling the context of the lemma is crucial
Unsupervised approaches:
e.g., Transforming word embeddings into sense embeddings using graph clustering
Not trival as the same word in different context can be predicted incorrectly
Hint to help with WSD
One sense per discourse hypothesis: knowing the genre of the document/sentence helps (The context is useful)
e.g., is given sentence from a document about ecology or business?
This is because document topic classification is easier than WSD, so it helpfult to use topic classification first.
Task: Summarisation
Automatic summarization is a classical NLP problem with more than 60 years
of history and still a HOT topic
Many possible computational approaches: sequence labeling, classification, ranking, rule-based e.g., syntax tree pruning
Extractive summarization
Information Extraction approach:
copy the most important information to the summary (e.g.: key phrases, clauses, sentences, paragraphs, etc.)
Algorithms
✓ Hueristic-based
✓ Classification problem
✓ Ranking
Steps
1) Sentence ranking
2) Sentence selection
3) Sentence reformulation (in novel methods)
4) Sentence ordering
Step 1: Relevance methods to assess which sentences are the most important
Input: sentences
Output: sentences are ranked according to their relevance
Common relevance methods:
keywords
position
titles
indicative phrases
Step 2: Sentence selection as classification
Each sentence is described by a set of features (as the previous step)
Two classes: extract | do-not-extract (binary classification process)
Algorithms:
Regression models for importance prediction
Learning to rank models that assign high ranks to important sentences
Sequence labeling models: model inter-sentence dependency
The problem of redundancy
Multi-document extractive summarization faces a problem of potential redundancy
→ extract sentences that are both "central" (i.e., contain the main ideas)
→ and "diverse" (i.e., they differ from one another)
Maximal Marginal Relevance (MMR) algorithm is to use to model redundancy
sim_1 = one query relevance
sim_2 = novelty of information
Unsupervised extractive summarisation
category description
documetns in category
algorithm for unsupervised keyphrase extraction
keyphrases with score
re-ranked results using MMR
Abstractive summarization
Text generation approach: generate
entirely new phrases and sentences to capture the meaning of the source document
Involves re-writing sentences
Sentence paraphrasing or
Sentence simplification or
Sentence compression or
or/and generation of novel content
From graphs or tables to sentences
1) Content determination (what information?)
2) Text/Doc structuring (ordering)
3) Sentence aggregation (merging sents. = readability, naturalness)
4) Lexicalization (from concepts to words)
5) Referring expressions generation (pronouns, anaphora)
6) Realization (according to syntax and morphology)
From sentences to sentences:
syntax-based heuristics
sequenceTOsequence models
Summary
Some NLP task are related to enrich the text to allow language formalization and understading
POS-Tagging, syntax parsing, NER,
Others tasks are related to final applications
Sentiment analysis, summarization
NLP in Practice_4 NLP Evaluation
Intrinsic evaluation
Directly test a task correctness using a gold standard:
Evaluate a POS-Tagger
Calculate the match between predicted and goldstandard POS-tags
Extrinsic evaluation
Test whether the output is useful for downstream tasks
Evaluate a summarization technique in a IR problem
Does having summaries (instead of complete docs) helps a particular
retrieval task?
Goldstandard is a set of correct answers/annotations/tags etc. which is
supposed to be representative of a problem.
Baseline is something to compare to, sometimes it’s something we want
to beat. E.g., a similar method, a simpler approach
Classifiers Evaluation
Evaluate using a test set (or goldstandard)
Do not evaluate/test in your training set
Cross-validation is an option when there is no test set
Classification metrics
Accuracy: the number of correct predictions, divided by the total number of instances
Not suitable for class-imbalance datasets (most data sets are imbalanced)
Precision, recall, and F – MEASURE (same as in IR)
There are two possible errors:
False positive: the system incorrectly predicts the label
False negative: the system incorrectly fails to predict the label
There are two ways to be correct:
True positive: the system correctly predicts the label
True negative: the system correctly predicts that the label does not apply to this
instance
Evaluating multi-class classification
Macro F-MEASURE: when there are multiple labels of interest (e.g., in word sense disambiguation
or emotion classification), it is necessary to combine the F–MEASURE across each class
Threshold free metric: ROC-AUC
AUC: Area Under The Curve
ROC: Receiver Operating Characteristics
For binary classification, it allows to trade off between Precision and Recall
AUROC of 0.5 (area under the red dashed line) corresponds to a coin flip, i.e. a useless model
AUROC less than 0.7 is sub-optimal performance
AUROC of 0.70 – 0.80 is good performance
AUROC greater than 0.8 is excellent performance
AUROC of 1.0 (area under the purple line) corresponds to a perfect
classifier
BLEU (bilingual evaluation understudy): a modified form of precision to compare a
candidate translation against multiple reference translations
Perfect match = 1.0
Perfect : mismatch = 0.0
pn = No. of n-grams in both references and hypothesis translation/No. of n-grams in hypothesis translation
Brevity penalty (BP) Precision-based metrics are biased in favor of short translations
A reference translation and three system outputs.
For each output, p n indicates the precision at each n-gram,
and BP indicates the brevity penalty
Term overlap evaluation metrics
Machine translation
Text summarization
Text simplification
Q&A
Chatbots
ROUGE (Recall-Oriented Understudy for Gisting Evaluation): calculate the recall
between human and automatic outputs in terms of n-grams (n-gram overlap)
ROUGE-N: Overlap of n-grams between the system and a goldstandard reference
Evaluation examples:
Semantic parsing
Dependency parsing
named entity recognition
Ranking
Term overlap metrics:
machine translation
text generation
text simplification
Evaluation Methodology
Always split in 3:
training set
validation set
test set
never test with training set
with small dataset, use cross validation technique:
Divide original data into k parts
Classifier comparison
Comparison is between algorithms
e.g., logistic regression vs. Perceptron,
L 2 regularization vs. L 1
Comparison is between feature sets
e.g., bag-of-words vs. word embeddings
word embeddings vs. character embeddings
Ablation testing involves systematically removing (ablating) various aspects of a classifier, such as feature groups, and testing the null hypothesis that the ablated classifier is as good as the full model
Evaluation can be tricky, e.g., in automatic summarization
Intrinsic evaluation:
Humans read the documents and decide which are the most relevant
Sentences
ROUGE measure: calculate the recall between human and automatic summaries in terms of n-grams (n-gram overlap)
Extrinsic evaluation:
Verify that the summaries are useful for an specific task, e.g.: text
classification
Issues regarding summarisation evaluation:
humans usually do not agree in which are the most important sentences;
There is more than one summary- Human enerated summaries are costly
Summary
Meaningful evaluations are essential for measuring success
NLP methods are mostly intrinsically evaluated
againts a gold standard
… gold standards are allways bias
Methods Comparison is not trivial
Statistical significance test is a fundamental tool
in method comparison
:star:Why Nivre's Parsing?
Syntactic structure of a sentence is described solely in terms of the words (or lemmas) in a sentence.
and an associated set of directed binary grammatical relations that hold among the words
Involves:
LEFTARC
RIGHTARC
SHIFT
REDUCE: Remove from the stack
:star: Why ADI and KNS?
:star: Why semantic?
Understanding of the logic and meaning of the sentence or text.
NLP in practice 4: Multi-lingual NLP
Multi-lingual word embeddings
Is the alignment between mono
lingual embeddings, there are several method
Supervised
using a train bilingual dictionary learn a mapping from the source to the target space using (iterative) Procrustes alignment
Unsupervised
without any parallel data, learn a mapping from the source to the target space using adversarial training and (iterative) Procrustes refinement
MUSE(Multilingual Unsupervised and Supervised Embeddings)
is a python library, uses fastTest embeddings, with large scale binligual dictionaries for training and evaluation.
Supervised word embeddings for 30 languages, aligned in a single vector space from fastText
Ground truth bilingual dictionaries from fastText
Similar geometric relations can be used to correspond languages (e.g. English and Spanish)
The geometric relations does not need dictionaries, and can be used for the unsupervised model by finding the spatial similarity of the bilingual languages.
Summary
Multi lingual NLP deals with cross lingual resources and models
There is a need for the development of Multi lingual applications, for example, commercial applications as product recommendation It's possible to build multi lingual NLP resources with monolingual or parallel corpora.... → lots of data
Universal linguistics are the key for success in multi lingual NLP
Low resource NLP
Languages lacking Languages lacking large monolingual or parallel corpora and/or manually crafted
NLP models requires large amount of traininng data and complext language specific engineering (specific engineering is expensive, requires linguistically trained speakers of
the language)
Reason:
Language diversity and preserving languages.
Extend to areas where there less to be listened to.
For emergency situation, Bilingual models can help with people speaks different languages.
There are large number (6k) of languages that
Active learning
Transfer learning
Multi-task learning
Learning to learn and Meta-learning
Semi-supervised learning
Dual learning
Unsupervised learning
Unsupervised learning:
Unsupervised POS-tagging;
Unsupervised dependency parsing
Brown clustering
Other ideas:
language projectsion
universal representations and interlinguas
Cross Lingual Transfer Learning:
Transfer of annotations
Such as POS tags, syntactic or semantic features via cross lingual bridges (e.g., word or phrase alignments)
Transfer of models (similar to pre-trained model):
Training a model in a resource rich language and applying it in a resource poor language in zero shot or one shot learning
Transfer other parameters: features
Joint Multilingual or “Polyglot” Learning
Resource rich and resource poor learning using a language universal representation
Convert data in all languages to a shared representation (e.g., multilingual word vectors)
Train a single model on a mix of datasets in all languages, to enable parameter sharing where possible
Summary
We have reviewed some of the fundamental NLP tasks that allows to access meaning and text understanding so that it’s possible to build NLP applications
Machine learning, especially deep learning, is the favorite current approach to NLP, but not the only one or the most suitable one. It depends on the task and data availability
Meaningful evaluations are essential for measuring success
There are more than 7000 languages in the world today
A few dominate → lots of resources available for NLP
Small languages → low resource NLP
Multi-lingual NLP problems
Intra
word code switching (even harder!)
As a sequence labeling problem at the character level
Tags with BIO encoding
As a text classification problem
Multi-lingual resources
Universal Dependencies: POS Tags, morphological features, and syntactic dependencies across 70 languages
:star:What are some of the NLP tasks?
:star:why pos-tagging
A POS tag is a tag that indicates the part of speech for a word .
POS tags have been used for a variety of NLP tasks and are extremely useful since they provide linguistic signal on how a word is being used within the scope of a phrase, sentence, or document.
What I mean by this is that the word “run” can be used as a verb “I run 5 miles every day” or as a noun “I went for a run”.
Sometimes the POS is very very useful in cases where it distinguishes the word sense (the meaning of the word).
In other cases, it is still useful in explain the syntactic role of a word and we can often infer semantic information from this due to our knowledge of how this syntactic role is commonly used semantically.
Dependency Parsing
Aims to find dependencies between words and their type
All the words, except one, have some relationship or dependency on other words in the sentence
Root = word with no dependency, usually a verb
Dependencies = all other words, which are directly or indirectly linked to the root word
In a dependency tree:
Nodes represent words
Edges represent dependencies
Analyses the grammatical structure of sentences, by establishing relationships between “head” words and “modifier” words, which change them