Please enable JavaScript.
Coggle requires JavaScript to display documents.
((Machine Learning (kNN Dimensionality is problematic (1NN assign label…
Machine Learning
kNN
Dimensionality is problematic
1NN
assign label of nearest neighbor
let \( \mathcal{N_k(x)}\) be the k nearest neighbors of vector x
\( p(z=c \ | \ x, k) = \frac{1}{N_k} \sum_{i \in \mathcal{N_k(x)}} \mathbb1(z_i = c) \) predicted label is argmax of previous expression
Distance Metrics
Problem kNN sensitive to scale
\( \rightarrow \) Data standarization \( x_{i,std} = \frac{x_i - \mu_i }{\sigma_i }\)
mahalanobis(x_1,x_2) \( = \sqrt{(x_1-x_2)^T\Sigma^{-1}(x_1, x_2)} \) if \( \Sigma = Cov\) this is exactly like data standarization
Generative model
assumption data points generated by multivariate gaussians
\( p(x \ | \ z=c) = \frac{1}{Nc} \sum_{x_n \in \text{class c}} \mathcal{N}(x\ | \ x_n ,\sigma \mathbf I) \ \)
\( p(z =c \ | \ x^*) \propto p(x \ | \ z=c) p(z=c) \)
Neighborhood component Analysis NCA
also learn metric
mahalanobis(x_1,x_2) \( = (Ax_1-Ax_2)^T (Ax_1 -A x_2) \)
crossentropy loss \(p_{i,j} = \frac{exp(- ||Ax_i-Ax_j||^2)}{\sum_{k \neq i}exp(- ||Ax_i-Ax_k||^2)} \)
Probability of corecct classification \( p_i = \sum_{j \in C_i} p_{i,j} \quad C_i = [j | c_i = c_j] \)
maximise \( f(A) = \sum_i p_i \)
if a non square \( \rightarrow \) can do dimensionaity reduction
Probability Theory
Data generating Processes
Reasoning about outcomes
Random Variables \( X: \Omega \rightarrow \mathbb{R}\)
eg. roll die 10 times \( \rightarrow \Omega \) tuples of length 10 let X counts number of six e.g. \( X(6,6,6,6,6,2,3,4,5,4) = 6 \
Multivariate random variables
\(F_{XY}(x,y)= P(X \leq x,Y\leq y) \)
\(f_{XY}(x,y) = \frac{\partial^2F_XY(x,y)}{\partial_x \partial_y}\)
Random vector
\( \mathbf{X}: \Omega \rightarrow \mathbb{R}^n\)
\(F_{X_1, X_2 ...X_n}(x_1,x_2,... x_n)= P(X_1 \leq x_1,X_2\leq x_2 ... X_n \leq x_n) \)
Covariance
\( \Sigma_{i,j}= Cov(X_i,X_j) \)
\( \mathbb{\Sigma}= E[(\mathbf{X}-E[\mathbf{X}])(\mathbf{X}-E[\mathbf{X}])^T] = E[\mathbf{X}\mathbf{X}^T] - E[\mathbf{X}] E[\mathbf{X}]^T\)
does not imply dependence but idendepence implies Covariance
identically distributed
if they have the same CDF or PDF/PMF
Gaussian
\( X \sim N(\mathbf{\mu} , \Sigma)\ \mu \in \mathbb{R}^n \Sigma \in \mathbb{R}^{n \ x \ n} \)
\( f_{\mathbf{X}}(x_1,x_2, .. x_n) = \frac{1}{\sqrt{2 \pi | \Sigma | }} e^{-\frac{1}{2} ( \mathbf{x}-\mu)^T \Sigma^{-1} (\mathbf{x} - \mu) }\)
Independence for all subsets \( I = [i_1, ... i_K] \in [1,....,N] \)
\( f_{X_{i_1},X_{i_2}...X_{i_k}}(x_{i_1},x_{i_2} ...x_{i_k}) = f_{X_{i_1}}f_{X_{i_2}}....f_{X_{i_k}} \)
Special Distributions
\( Gamma(x \ | \ a,b) = \frac{1}{\Gamma(a)} b^a x^{a-1}e^{-bx} \)
\( Beta(x\ | \ a,b) = \frac{\Gamma(a+b)}{\Gamma(a) \Gamma(b)} x^{a-1} (1-x)^{b-1}\)
Gauss
\( X \sim N(\mu, \sigma^2) \)
\( \mathcal{N}(x \ | \ \mu,\sigma^2) = \frac{1}{\sqrt{2 \pi} \sigma } e^{-\frac{(x - \mu)^2}{2 \sigma^2}} \)
Uniform
\( U( x \ | \ a,b) =\frac{1}{b-a} \)
Binomial Distribution
\(X \sim Bin(N,\mu) \rightarrow Bin(x \ | \ N,\mu) = \binom {n}{k} \mu^x (1-\mu)^{N-x}\)
\( \rightarrow X \sim Bin(N, \mu) \rightarrow X \sim Poi(\lambda) \quad \lambda = N \mu \quad Poi(x \ | \ \lambda) \frac{e^{- \lambda } \lambda^x}{x!} \)
Bernulli
\( X \sim Ber(\mu) , p_X(x) \rightarrow Ber(x \ | \ \mu) = \mu^x (1-\mu)^{1-x}\)
Concepts
\( E[g(X)] = \int g(x) f_X(x) dx \)
\( Var(X) = E[(X-E[X])^2] \)
\( H[X] = - \sum_x P(X=x) \ ln(P(X=x)) = E[ln(P(X)] \)
\( KL(p_1 || \ p_2 ) = - \sum_x p_1(x) ln (\frac{p_2(x)}{p_1(x)}) \)
Cumulative distribution function \( F_X : \mathbb{R}: \rightarrow [0,1] \)
\( F_X(x) = P(X \leq x)\)
takes output of random variable X
if \( F_X(x) = F_Y(y) \) X and Y are identially distributed
continuous random variable
\( P(a\leq X \leq b) := P( [w \in \Omega : a \leq X(\omega) \leq b ] ) \)
probability density Function PDF
\( f_X(x) = \frac{\partial F_X(x)}{\partial x}\)
Discrete Random variable
\( P(X=x) := P( [w \in \Omega : X(\omega) =x)] ) \)
probability mass function PMF
\( p_X(x) = P(X=x)\)
Concepts:
Conditional Probability
\( P(A|B) = \frac{P(A,B)}{P(B)} \)
Multiplacation law
law of total probability \(\rightarrow \) Bayes
\( P(A_1,A_2 ...A_n) = P(A_n|A_{n-1}...A_1) ... P(A_3|A_2,A_1) P(A_2|A_1)P(A_1) \)
\( P(B) = \sum_{ A \in partition} P(B,A) = \sum_{A \in partition} P(B|A)*P(A) \)
Independence
P(A,B) = P(A)P(B)
Probability Space \( (\Omega, \mathcal{F}, P) \)
Probability measure \(P : \mathcal{F} \rightarrow [0,1]\)
For \( A_1, A_2 ... \ \text{disjoint} \ P( \cup_i A_i) = \sum_i P(A_i) \)
Set of events \( \mathcal{F}\) is a \( \sigma\)-field
consisting of subsets of \( \Omega \)
e.g. \( A \in \mathcal{F}\quad A = [2,4,6]\)
Sample Space \( \Omega \)(Outcomes)
e.g = \( \Omega = [1,2,3,4,5,6 ]\)
Statistical Inference
Given Data what can say about the data generatiing Process
Parameter Inference
Amount of Data decides approach
Coin Flip(2 Flips)
Make Parameter a random variable \(\theta \)
Want the maximum a posteriori estimate \( \theta_{MAP}= argmax\ p(\theta \ | \ D ) \)
\( p(\theta= x \ | \ D )= \frac{p(D \ | \ \theta =x) p (\theta =x) }{p(D)}\)
choose prior \( p(\theta=x) \) such that computations are easy( in this case beta dist. conjugate prior)
\( p(\theta = x \ | \ D) \propto x^{|T|+a-1}(1-x)^{|H|+b-1}\)
Coin Flip (10 Flips)
Maximise likelihood of observation MLE
\( \theta_{MLE}= argmax \ p(D \ | \ \theta) \)
\( p( D \ | \ \theta_1 , ... \theta_{10}) \)
independece \( \rightarrow = \prod_{i=1}^{10} p(F_i=f_i \ | \ \theta_i) \)
iid \(\rightarrow = \prod_{i=1}^{10} p(F_i=f_i \ | \ \theta) \)
maximise log likelihood instead
\( \theta_{MLE} = \frac{T}{H+T}\)
Fully Bayesian analysis
\( P(F=f \ | \ D, a,b,) \) integrate out theta instead of maximising \( \theta \)
\( p(f \ | \ D,a,b) = \int p(f, \theta \ | \ D,a,b) d \theta \int p(f |\theta) \ p(\theta \ | \ D,a,b) d \theta = \)
Linear Regression
input \( X = x_1,x_2, ... x_n) \)
targets \( z = (z_1, z_2, .. z_n) \)
\( z_i = y(x_i) + \epsilon \)
\( E_D(W) = \frac{1}{2} \sum [(y(x_n,W)-z_n)]^2 = \frac{1}{2}(\Phi W -z)^T(\Phi W - z) \quad \phi \text{ is design matrix of } \phi\)
\(y(X,W) = w_0 + \sum_{j=1}^{M-1}w_j \phi_j(X) = W^{T} \phi(X)\)
basis functions:
\( \phi_j(x) = x_j \)
\( \phi_j(x) = e^{(x-\mu_j)^2/2\sigma^2} \)
-\( \nabla_W E_D(W) = 0 \rightarrow W_{opt}= (\Phi^T \Phi)^{-1}\Phi^T z \)
Problem overfitting (ridge)
\( E_D(W) = \frac{1}{2} \sum [W^T\phi(x_n)-z_n)]^2+\frac{\lambda}{2} ||W||_2 \)
Bayesian Regression model noise as gaussian MLE estimation
\( p(z \ | \ x,W, \beta) = \mathcal{N}(y(x,W).\beta^{-1})\)
\( p(Z \ | \ X,W,\beta)= \prod_{n=1}^N \mathcal N (z_n \ | \ y(x_n, W), \beta^{-1}) \)
maximise log likelyhood with respect to W \( \rightarrow \frac{1}{2} \sum [(y(x_n,W)-z_n)]^2 \) Same quadratic error function
predict using \( P(z \ | \ x, W_{ML}, ß_{ML}) \) Problem overfitting
solution : MAP with conjugate (gaussian) prior \( P(W | Z) = \mathcal N (W | M_N, S_N) \) this is equivalent to quadritic + wheight regularization
Kernel
Radial Basis function
Problem dimensionality!
Let \( X = [x_n | n = 1,...N] \)
Decompose\( y = w^T x = (\tilde w+z)^Tx = \tilde w^T x \quad \tilde w \in Span(X) \rightarrow w = \sum_n a_n x_n \)
or more general \( w = \sum_n a_n \phi(x_n) \)
\( \rightarrow y(x) = \sum_n a_n(\phi_n)^T \phi(x) = \sum_n a_n K(x_n,x) \quad K(x,y) = \phi(x)^T \phi(y)\)
TODO
Mercers Theorem
OVercomplete