Natural Language Processing

Represent meaning

One hot encoding

Distributional Semantic

Huge Vector

No natural notion of similarity

WordNet

Synonym sets

Hypernyms ("is a" relation)

A word's meaning is given by the words that frequently appear close-by

Word Embedding (Word Vector)

Build dense vector for each word -> Similar to vectors of words appear in similar context

Word2Vec: framework to learning word vector

For each position t: predict context words within a window of fixed size m,
given center word w_t

Skip Gram

Train with Negative Sampling: train binary logistic regression (sigmoid)

True pair: center + word in context

Noise pair: center + random word

Maximize probability of 2 words co-occurring +
Minimize probability of noise words

P(w_t + i | w_t; theta)

P(context | center) = Softmax(u_context, v_center)

Simple but expensive to train

2 vectors for each word

v: when center

u: when context

Theta: Vector of u and v of all words

Count Based

Build co-occurrence matrix

Use window around each word to captures some syntactic and semantic information

Vector size increase with Vocabulary

High dimensionality and sparse -> less robust model

Need improvements

Singular Value Decomposition to help reduce dim

Scale counts

log(freq)

min(count, t=100)

Ignore function words

Use Pearson correlations instead of counts

GLoVE

Ratio of co-occurrence probabilities
can encode meaning components

Log bilinear model: w_i . w_j = log(P(i | j))
with vector differences: w_x(w_a - w_b) = log ( P(x | a) / P(x | b) )

Advantages

Good performance even with small corpus and small vectors

Fast training

Scalable to huge corpora

Dependency Structure:
which words depend on which other words
(modify, attach to, arguments of)

Context Free Grammar

Get in a sequence of words
Need to figure out the structure behind -> Get meaning

Syntactic structure consists of relations between lexical items
-> form a tree
-> Treebank

Dependency Parsing

Parse a sentence by choosing for each word what other word it is a dependent of

Dynamic Programming

Graph Algorithm

Constraint Satisfaction

Transition-based parsing

Greedy Transition-based parsing

Shift / Left-Arc / Right-Arc from Stack + Buffer

TraIn a Classifier to predict next action

Linear time

Softmax linear separation -> not strong enough

Use a Neural Classifier more complex, non-linear

Neural Dependency Parser

Compute score for every possible dependency

Repeat for each word -> Build Minimum spanning tree