Please enable JavaScript.
Coggle requires JavaScript to display documents.
Natural Language Processing (N-Grams (Uni-gram - looks at the previous…
Natural Language Processing
Accuracy
True Pos - Correctly collected data
True Neg - Correctly ignored data
False Pos - Incorrectly collected data
False Neg - Incorrectly ignored data
Precision = TP / TP + TN
Recall = TP / TP + FP
F1 = 2(Pre x Rec / Pre + Rec)
Stem/Lemm
Stemming - Tries to chop the suffix off a word. The generated word isn't always correct, however it is still useful
Lemmatization - Is a bit smarter than stemming, looks at a word and tries to produce its lemma. Is more accurate with PoS tagging
FSA
Finite State Automata - A linguistic diagram which describes a grammar. Can also be in a text format with a start and end state
Can be used to explain a regular expression.
Limited with what they can do
Morphology
Study of how words are formed from smaller units
Stem - Main part of the word. Eg. inflate
Suffix - Part added to the end of a word. Eg. inflated
Prefix - Part added to the start of a word. Eg. de-inflate
Inflection - stem + morpheme -> same class. Eg stone -> stones
Derivation - stem + morpheme -> different class. Eg. hammer -> hammering
Morpheme - One of these units. Basically a part of a word that has a meaning
N-Grams
Uni-gram - looks at the previous word to calculate the next probable word, based on a corpus
Bi-gram - looks at the previous 2 words
Tri-gram - looks at the previous 3 words
These can be chained together, so that it looks for a set of 3 words, if it doesn't find it, then the previous 2 words etc.
Corpora is often 75-80% training, 20-25% testing
Laplace Smoothing - To deal with probability of 0, P(c) = count/n -> P(c) = count+1/n+v
Good Turing - Uses a count of things you've seen once, to estimate the probability of things you haven't seen. Eg. P(unseen) = P(things seen once)/Num of things
Used to predict the probability of the next word, given a sequence of words
Words
Open class - new words are invented all the time. Eg. Noun, Verb, Adjective
Closed class - set words. Eg. Prepositions, Determiners, Auxiliary
Proper Nouns - Names of people or places. Eg. Adam, Bristol
Proper Names - Extensions of names. Eg. Peter the Great
Part of Speech tagger - Goes through text and tags each word based on a tagset.
Brill Tagger - Take a non-tagged and a correctly tagged corpus. Tag the non-tagged with the most common POS tag. Then generate a set of rules which slowly transform it into the correctly tagged one.
A good industry level tag is 96%. Average human accuracy is 96-97%
Hidden Markov Model
HMM - Given a word, uses bayes to determine the version. Eg. I "can" do that / A "can" of beans
Markov model - A diagram which uses the laws of the language to create sentences. Each arc is weighted with the probability that the next word will be of that type
Can be trained with Viterbi Algorithm
Grammar
Phrases - Groups of words which categorise to describe a single object. Eg. Angry men - noun phrase
Context free grammar - eg. S -> NP VP ; NP -> noun noun etc..
Parsing - Using a grammar to create a valid sentence
Features - Extra notation in CFG's that adds arguments to each word/phrase. Deals with tense and agreement
recap 22nd oct