Please enable JavaScript.
Coggle requires JavaScript to display documents.
SNLP, Blindingly add the same counts to all grams., The updated counts are…
SNLP
Information Theory
-
-
Joint Entropy:
\( H(A, B) = \sum_{x \in A, y \in B} - P(x, y) \log P(x, y) \).
-
-
Conditional Entropy:
\( H(Q|P) = H(P, Q) - H(P) = \sum_{x \in P, y \in Q} - p(x, y) \log \frac{p(x, y)}{p(x)} \)
Mutual Information:
\( I(A, B) = H(A) + H(B) - H(A, B) = \sum_{x \in A, y \in B} p(x, y) \log \left( \frac{p(x, y)}{p(x)p(y)} \right) \)
Probability Basics
Perplexity:
- Measure the quality of a language model.
- PP = exp\( \left( - \sum_{w, h} f(w, h) \ln(P(w|h) \right) \). (1*)
- PP is simply the exponential of cross-entropy.
- PP is the effective vocab size, or the branching factor. That is, the smaller the PP, the easier we guess the next word.
- Differentiable.
- Alternative: Mean-rank (not differentiable).
Zipf's law:
- The frequency of a word is inversely proportional to its rank.
- This results in a straight line in the doubly-log axes plot.
- \( f(r) \propto \frac{1}{r} \).
Mandelbrot distribution:
- \( f(r) = \frac{\mu}{(c+r)^B} \).
where \( \mu \) is the normalizing constant to keep the sum of frequencies to be 1.
Marginal distribution:
(right) \( \sum_{w_1 \in V} P(w_1, w_2) = P(w_2) \).
Joint probability as a product of conditional probabilities:
\( P(w_1, w_2, .., w_n) = \prod_{i=1}^n P(w_i|w_1, w_2, .., w_{i-1}) \).
-
Markov assumption: we can truncate the history (e.g. unigram, bigram, etc.)
-
Classification
Implicit label: e.g. search on google: click on an url and then never coming back -> successful suggestion.
Hierarchical classification:
- less classes at each layer of the hierarchy.
- we obtain multiple level predictions (from abstract to concrete).
Feature selection:
- Eliminate features,
- Weight features,
- Normalize features,
- Transform features (e.g. SVD)
Feature Extraction:
Class independence
Term Frequency:
\( TF(t, d) = \frac{f_{t, d}}{\sum_{t' \in d} f_{t', d}} \)
Inverse Document Frequency:
\( IDF(t, D) = \log \frac{N}{|{d \in D : t \in d}|} \)
-
Class dependence
Information Gain:
\(\begin{align*} G(t) &= H(C) - H(C|t) \\ &= H(C) + p(t) \sum_i p(c_i|t) \log p(c_i|t) + p(\neg t) \sum_i p(c_i|\neg t) \log p(c_i|\neg t) \end{align*}\)
Pointwise Mutual Information:
\( pmi(t, c) = \log \left( \frac{p(t, c)}{p(t)p(c)} \right) \)
Chi-statistics:
\( Chi^2 = \sum_{cells} \frac{(O-E)^2}{E} \)
Algorithms
SVM
Good:
- Robust against over-fitting.
- Robust against high-dimensional data.
- Work well for small datasets.
Bad:
- Usually, data is not linearly separable.
- One outlier may largely affect the boundary.
Decision tree
Good:
- Need less pre-processing.
- can embed domain knowledge.
- interpretable.
Bad:
- Hard boundary.
- Weak against step-like boundaries.
- Overfitting
- Weak against joint dependency.
-
Sequence Labeling
Hidden Markov Model
-
\( P(x_1, x_2, .., x_N, \pi_1, \pi_2, .., \pi_N) = \prod_{i=1}^N P(x_i|\pi_i) P(\pi_i|\pi_{i-1}) \)
- The label \( \pi_i \) is only affected by \( x_i \) and \(\pi_{i-1}\).
- Emission and transition probability, respectively.
- Train by EM (Baum-Welch), gradient descent.
- Inference by Viterbi.
Conditional Random Field
-
Linear CRF \( Score(y|x) = \sum_j \sum_i \lambda_j f_j(i, y_{i-1}, y_i, x) \).
- Apply softmax function (i.e. exp and normalize) to obtain probability.
- Train with gradient descent.
- Inference with Viterbi in \( O(nm^2) \).
Bayes Network:
- The causal DAG network with Markov assumption.
Markov Random Field:
joint distribution is expressed in terms of the maximal cliques.
\( p(x) = \frac{1}{Z} \prod_{c \in C_M} \Psi_C(x_C) \)
where \(Z = \sum_x ... \)(numerator)
Language Modelling
Definitions
OOV: Out of vocab \( \propto \frac{1}{\text{(vocab size)}^\alpha} \)
OOC: Out of context
- Backing-off.
- Use synonym.
- word-piece.
- Unknown character.
Interpolation vs Backing-off:
- Both involves lower-order LM.
- Backing-off only go to the lower-order LM if we can't believe the high-order LM (e.g. it return 0), while Interpolation always involves lower-order LM.
Count-tree:
- Index backward to make it easier for backing off.
Prune the count-tree:
- Reduce memory.
- Smoothing the LM by adding some counts to the grams with 0 or small counts.
Redundancy:
- happens when we use more symbols to represent a word than needed.
- Disadvantage: requires longer encoding.
- Advantage: easier correction, or guess the word.
MAP:
- A way to encode domain knowledge into LM.
- We add count \( \alpha_i \) to word \(w_i\).
- Additive smoothing is a simple MAP with assumption of uniform distribution (all \(\alpha_i\) are the same).
Smoothing
Additive smoothing:
\(P(w|h) = \frac{N(w, h) + \epsilon}{N(h) + \epsilon |V|} \)
Absolute discounting:
\(P(w|h) = max(0, \frac{N(w, h) - d}{N(h)}) + \lambda(h) \beta(w|\hat{h}) \)
- \( \lambda = \frac{d}{N(h)} N_+(h \bullet) \)
- d is often in [0.7, 0.9]
- \(d \approx \frac{n_1}{n_1 + 2n_2} \), where \(n_i\) is the number of words that appear exactly I times.
Linear Interpolation:
\(P(w|h) = (1-\epsilon) \frac{N(w, h)}{N(h)} + \epsilon \beta(w|h) \)
Good-Turing:
\( r^* = (r+1)\frac{n_{r+1}}{n_r} \)
where r is the current count, \(r^*\) is the new count, \(n_r\) is the number of grams that occur r times.Note that original count is used for normalization.
\(P(w|h) = \frac{N^*(w, h)}{N(h)} + \lambda (h) \beta(w|h) \)
- A fix for the problem of spikes in high ranks, we can use \( Z_r = \frac{N_r}{0.5 (t-q)} \)
Kneser-Ney:
\( \beta(w|\hat{h}) = \frac{N_+(\bullet \hat{h}w)}{N_+(\bullet \hat{h} \bullet)} \)
- An improvement over Absolute discounting by choosing a good choice of backing-off LM.
- The same as Absolute discounting if used for unigramLMs.
Other types of LMs
Class-based LM:
- We cluster the words into clusters, and then the LM predicts the next words by predicting the cluster of the next word before predicting that next word.
p(w|h) = p(w|c) * p(c|h)Advantages:
Neural LM:Advantages:
- Use continuous space, so eliminate the curse of dimensionality.
- Model semantic relationships of words as linear combinations.
Disadvantages:
- Computational bottleneck at the output layers (i.e. need to compute exp on all words)
Text compression
Kraft's inequality:
For a set of m symbols, those symbols are encoded with code-word lengths \( l_1, l_2, .. l_m \) using an alphabet of size D.
If this encoding is uniquely decodable then \( \sum_{i=1}^m D^{-l_i} \le 1 \).
-
Long-range dependency
Correlation:
- \( c_d(w) = \frac{P_d(w, w)}{P(w)^2} \).
- where \(P_d(w, w) \) is the probability we have 2 occurrences of w appear at distance d.
- range \( [0, \infty) \), with 1 means independence.
Conditional Entropy:
- \( h_n = - \sum_{w_1, w_2, .., w_{n+1}} P(w_1..w_{n+1}) \log P(w_{n+1}|w_1..w_n) \).
- range [0, H(W)], for 0 means fully dependence and H(W) means independence.
-
-
-
-