Please enable JavaScript.

Coggle requires JavaScript to display documents.

((Machine Learning (kNN Dimensionality is problematic (1NN assign label…

- - - - 1NN
        assign label of nearest neighbor
      - let \( \mathcal{N_k(x)}\) be the k nearest neighbors of vector x
        \( p(z=c \ | \ x, k) = \frac{1}{N_k} \sum_{i \in \mathcal{N_k(x)}} \mathbb1(z_i = c) \) predicted label is argmax of previous expression
      - Distance Metrics
        
        Problem kNN sensitive to scale
        
        \( \rightarrow \) Data standarization \( x_{i,std} = \frac{x_i - \mu_i }{\sigma_i }\)
        
        mahalanobis(x_1,x_2) \( = \sqrt{(x_1-x_2)^T\Sigma^{-1}(x_1, x_2)} \) if \( \Sigma = Cov\) this is exactly like data standarization
      - Generative model
        assumption data points generated by multivariate gaussians
        \( p(x \ | \ z=c) = \frac{1}{Nc} \sum_{x_n \in \text{class c}} \mathcal{N}(x\ | \ x_n ,\sigma \mathbf I) \ \)
        \( p(z =c \ | \ x^*) \propto p(x \ | \ z=c) p(z=c) \)
      - Neighborhood component Analysis NCA
        also learn metric
        
        mahalanobis(x_1,x_2) \( = (Ax_1-Ax_2)^T (Ax_1 -A x_2) \)
        
        crossentropy loss \(p_{i,j} = \frac{exp(- ||Ax_i-Ax_j||^2)}{\sum_{k \neq i}exp(- ||Ax_i-Ax_k||^2)} \)
        
        Probability of corecct classification \( p_i = \sum_{j \in C_i} p_{i,j} \quad C_i = [j | c_i = c_j] \)
        
        maximise \( f(A) = \sum_i p_i \)
        
        if a non square \( \rightarrow \) can do dimensionaity reduction
    - - Random Variables \( X: \Omega \rightarrow \mathbb{R}\)
        eg. roll die 10 times \( \rightarrow \Omega \) tuples of length 10 let X counts number of six e.g. \( X(6,6,6,6,6,2,3,4,5,4) = 6 \
        
        Multivariate random variables
        \(F_{XY}(x,y)= P(X \leq x,Y\leq y) \)
        \(f_{XY}(x,y) = \frac{\partial^2F_XY(x,y)}{\partial_x \partial_y}\)
        
        Random vector
        \( \mathbf{X}: \Omega \rightarrow \mathbb{R}^n\)
        \(F_{X_1, X_2 ...X_n}(x_1,x_2,... x_n)= P(X_1 \leq x_1,X_2\leq x_2 ... X_n \leq x_n) \)
        
        Covariance
        \( \Sigma_{i,j}= Cov(X_i,X_j) \)
        \( \mathbb{\Sigma}= E[(\mathbf{X}-E[\mathbf{X}])(\mathbf{X}-E[\mathbf{X}])^T] = E[\mathbf{X}\mathbf{X}^T] - E[\mathbf{X}] E[\mathbf{X}]^T\)
        
        does not imply dependence but idendepence implies Covariance
        
        identically distributed
        if they have the same CDF or PDF/PMF
        
        Gaussian
        \( X \sim N(\mathbf{\mu} , \Sigma)\ \mu \in \mathbb{R}^n \Sigma \in \mathbb{R}^{n \ x \ n} \)
        \( f_{\mathbf{X}}(x_1,x_2, .. x_n) = \frac{1}{\sqrt{2 \pi | \Sigma | }} e^{-\frac{1}{2} ( \mathbf{x}-\mu)^T \Sigma^{-1} (\mathbf{x} - \mu) }\)
        
        Independence for all subsets \( I = [i_1, ... i_K] \in [1,....,N] \)
        \( f_{X_{i_1},X_{i_2}...X_{i_k}}(x_{i_1},x_{i_2} ...x_{i_k}) = f_{X_{i_1}}f_{X_{i_2}}....f_{X_{i_k}} \)
        
        Special Distributions
        
        \( Gamma(x \ | \ a,b) = \frac{1}{\Gamma(a)} b^a x^{a-1}e^{-bx} \)
        
        \( Beta(x\ | \ a,b) = \frac{\Gamma(a+b)}{\Gamma(a) \Gamma(b)} x^{a-1} (1-x)^{b-1}\)
        
        Gauss
        
        \( X \sim N(\mu, \sigma^2) \)
        
        \( \mathcal{N}(x \ | \ \mu,\sigma^2) = \frac{1}{\sqrt{2 \pi} \sigma } e^{-\frac{(x - \mu)^2}{2 \sigma^2}} \)
        
        Uniform
        \( U( x \ | \ a,b) =\frac{1}{b-a} \)
        
        Binomial Distribution
        \(X \sim Bin(N,\mu) \rightarrow Bin(x \ | \ N,\mu) = \binom {n}{k} \mu^x (1-\mu)^{N-x}\)
        
        \( \rightarrow X \sim Bin(N, \mu) \rightarrow X \sim Poi(\lambda) \quad \lambda = N \mu \quad Poi(x \ | \ \lambda) \frac{e^{- \lambda } \lambda^x}{x!} \)
        
        Bernulli
        \( X \sim Ber(\mu) , p_X(x) \rightarrow Ber(x \ | \ \mu) = \mu^x (1-\mu)^{1-x}\)
        
        Concepts
        
        \( E[g(X)] = \int g(x) f_X(x) dx \)
        
        \( Var(X) = E[(X-E[X])^2] \)
        
        \( H[X] = - \sum_x P(X=x) \ ln(P(X=x)) = E[ln(P(X)] \)
        
        \( KL(p_1 || \ p_2 ) = - \sum_x p_1(x) ln (\frac{p_2(x)}{p_1(x)}) \)
        
        Cumulative distribution function \( F_X : \mathbb{R}: \rightarrow [0,1] \)
        
        \( F_X(x) = P(X \leq x)\)
        
        takes output of random variable X
        
        if \( F_X(x) = F_Y(y) \) X and Y are identially distributed
        
        continuous random variable
        \( P(a\leq X \leq b) := P( [w \in \Omega : a \leq X(\omega) \leq b ] ) \)
        
        probability density Function PDF
        
        \( f_X(x) = \frac{\partial F_X(x)}{\partial x}\)
        
        Discrete Random variable
        \( P(X=x) := P( [w \in \Omega : X(\omega) =x)] ) \)
        
        probability mass function PMF
        
        \( p_X(x) = P(X=x)\)
      - Concepts:
        
        Conditional Probability
        \( P(A|B) = \frac{P(A,B)}{P(B)} \)
        
        Multiplacation law
        
        law of total probability \(\rightarrow \) Bayes
        \( P(A_1,A_2 ...A_n) = P(A_n|A_{n-1}...A_1) ... P(A_3|A_2,A_1) P(A_2|A_1)P(A_1) \)
        \( P(B) = \sum_{ A \in partition} P(B,A) = \sum_{A \in partition} P(B|A)*P(A) \)
        
        Independence
        P(A,B) = P(A)P(B)
      - Probability Space \( (\Omega, \mathcal{F}, P) \)
        
        Probability measure \(P : \mathcal{F} \rightarrow [0,1]\)
        For \( A_1, A_2 ... \ \text{disjoint} \ P( \cup_i A_i) = \sum_i P(A_i) \)
        
        Set of events \( \mathcal{F}\) is a \( \sigma\)-field
        consisting of subsets of \( \Omega \)
        e.g. \( A \in \mathcal{F}\quad A = [2,4,6]\)
        
        Sample Space \( \Omega \)(Outcomes)
        e.g = \( \Omega = [1,2,3,4,5,6 ]\)
      - Statistical Inference
        Given Data what can say about the data generatiing Process
    - - Coin Flip(2 Flips)
        Make Parameter a random variable \(\theta \)
        Want the maximum a posteriori estimate \( \theta_{MAP}= argmax\ p(\theta \ | \ D ) \)
        \( p(\theta= x \ | \ D )= \frac{p(D \ | \ \theta =x) p (\theta =x) }{p(D)}\)
        choose prior \( p(\theta=x) \) such that computations are easy( in this case beta dist. conjugate prior)
        \( p(\theta = x \ | \ D) \propto x^{|T|+a-1}(1-x)^{|H|+b-1}\)
      - Coin Flip (10 Flips)
        Maximise likelihood of observation MLE
        \( \theta_{MLE}= argmax \ p(D \ | \ \theta) \)
        \( p( D \ | \ \theta_1 , ... \theta_{10}) \)
        independece \( \rightarrow = \prod_{i=1}^{10} p(F_i=f_i \ | \ \theta_i) \)
        iid \(\rightarrow = \prod_{i=1}^{10} p(F_i=f_i \ | \ \theta) \)
        maximise log likelihood instead
        \( \theta_{MLE} = \frac{T}{H+T}\)
      - Fully Bayesian analysis
        \( P(F=f \ | \ D, a,b,) \) integrate out theta instead of maximising \( \theta \)
        \( p(f \ | \ D,a,b) = \int p(f, \theta \ | \ D,a,b) d \theta \int p(f |\theta) \ p(\theta \ | \ D,a,b) d \theta = \)
    - - \( E_D(W) = \frac{1}{2} \sum [(y(x_n,W)-z_n)]^2 = \frac{1}{2}(\Phi W -z)^T(\Phi W - z) \quad \phi \text{ is design matrix of } \phi\)
        \(y(X,W) = w_0 + \sum_{j=1}^{M-1}w_j \phi_j(X) = W^{T} \phi(X)\)
        basis functions:
        
        \( \phi_j(x) = x_j \)
        
        \( \phi_j(x) = e^{(x-\mu_j)^2/2\sigma^2} \)
        -\( \nabla_W E_D(W) = 0 \rightarrow W_{opt}= (\Phi^T \Phi)^{-1}\Phi^T z \)
      - Problem overfitting (ridge)
        \( E_D(W) = \frac{1}{2} \sum [W^T\phi(x_n)-z_n)]^2+\frac{\lambda}{2} ||W||_2 \)
      - Bayesian Regression model noise as gaussian MLE estimation
        \( p(z \ | \ x,W, \beta) = \mathcal{N}(y(x,W).\beta^{-1})\)
        \( p(Z \ | \ X,W,\beta)= \prod_{n=1}^N \mathcal N (z_n \ | \ y(x_n, W), \beta^{-1}) \)
        maximise log likelyhood with respect to W \( \rightarrow \frac{1}{2} \sum [(y(x_n,W)-z_n)]^2 \) Same quadratic error function
        
        predict using \( P(z \ | \ x, W_{ML}, ß_{ML}) \) Problem overfitting
        
        solution : MAP with conjugate (gaussian) prior \( P(W | Z) = \mathcal N (W | M_N, S_N) \) this is equivalent to quadritic + wheight regularization
    - - Radial Basis function
        Problem dimensionality!
      - Let \( X = [x_n | n = 1,...N] \)
        Decompose\( y = w^T x = (\tilde w+z)^Tx = \tilde w^T x \quad \tilde w \in Span(X) \rightarrow w = \sum_n a_n x_n \)
        
        or more general \( w = \sum_n a_n \phi(x_n) \)
        
        \( \rightarrow y(x) = \sum_n a_n(\phi_n)^T \phi(x) = \sum_n a_n K(x_n,x) \quad K(x,y) = \phi(x)^T \phi(y)\)