Please enable JavaScript.

Coggle requires JavaScript to display documents.

COMP41021 Representation Learning - Coggle Diagram

- - - - As data in the real world is high dimensional, it brings geometrical complexities and ML algorithms need to work with uncertainty in what exactly it will be. We model this data using the language of "random variables" and we need formal ways of encoding these.
    - - Uniform distribution, p(x) = 1/4 for all x in [-2, 2] and 0 elsewhere, making these values of x equally likely.
      - Gaussian distribution, p(x,y), allows data to be any vector in the domain R^2, but it gets exponentially less likely farther than the origin.
  - - - For f: R^n -> R, a differentiable convex function, find min f(x) so that AX = b such that this minima exists as A ∈ R^(Pxn) has rank p <= n (rows are linearly independent)
      - x* ∈ R^n solves the above question iff there exists a vector λ* ∈ R^p such that (You are only required to know how to use this result, not prove it)
  - - - Note that in the general some of the colums of U and V can correspond to 0 eigenvectors of XX^T and X^TX respectively.
- - - - "If F has a global minima, then when does G.D. find one of them and which one and how fast?"
- - - - A vector space is represented by a set of basis vectors, u1, u2, ..., ud such that ∀a ∈ R^d, a = a1u1 + a2u2 + ... + adud
      - For any linear model used for representation learning, find a linear transformation (projection matrix) that carries out a pre-specified learning goal.
      - For dimension reduction, find p (p < d) basis vectors to form the projection matrix in linear transformation.
    - - For a dataset X = (Xij)_(dxN) = {x1, x2, ..., xN}, we can get its mean vector
      - Data centralisation is done by xn - m, n = 1, 2, ... N
  - - - Centralise the data as above
      - To find the 1st PC of maximum variance, u, we need to establish a utility function by encoding this learning goal via assoiciating the data with the parameter vector.
      - Let zn denote the orthogonal projection of the centralised data, onto the parameter vector of the unit length, u, in the new vector space. We have i.e.
      - The mean of the projected data is zero due to data centralisation.
      - On the u basis (axis), we esimate the variance of z with m = 0:
      - Inserting zn = (xn - m)Tu into this equation leads to where S is the empirical covariance matrix of X
      - Thus, the learning objectives to find the 1st PC for maximum variance is formed:
      - Apply the Lagrangian multiplier to form the unconstrained utility function:
      - To find the optimal u, apply the optimality condition with respect to u. . Thus we know u must be an eigenvector of S with eigenvalue λ.
      - Apply our findings regarding u and λ to the variance of z: . To maximise Var(z), we must choose the u* with the largest λ*
      - The choice of u* leads to the 1st principle component
    - - We can find out more PCs with the same idea, e.g. the 2nd PC learning objectie as follows: s.t.
      - In general p, (p < d) PCs u1, u2, ..., up are achieved by doing the eigen analysis on the covariance matrix, S, and sort top p eigenvalues
      - Property: all p PCs are orthonormal, i.e. ui^Tuj = δ(i, j) = 1 if i = j and 0 otherwise, which act as a set of p axes to form a new "coordinate" system
      - Therefore, we can use PCA, P_(pxd) = {u1, u2, ..., up}, to achieve a p-dimensional representation, z = [zn1, zn2, ..., znp]^T, for d-dimensional datapoints, xn. or , i = 1,2,...,p where m is the mean of x.
    - - Motivation: From the p-dimensional representation, zn, of a d-dimensional data point, xn in the PCA space (p < d), we can reconstruct thie point in the original d-dimensional space (with some reconstruction error, e_n).
      - The total quadratic reconstruction error is measured by , the sum squared length of the blue lines shown below. This, we can formulate via minimising the total reconstruction error.
      - From this perspective, we formulate the PCA learning objectives where Z(pxN) = {z1, z2, ..., zN} is a collective notation for low-dimensional representation of all data points in X and W(dxp) = {w1, w2, ..., wp} indicating p basis vectors.
      - Minimising with repect to Z and W along the constraints leads to
      - Property: on the training dataset, X, it is also known as the total reconstruction error, E, is equal to the sum over the unused PC's eigenvalues
      - Here, ui is the ith PC, i.e. the ith eigenvector based on the ranked eigenvalues of S
  - - - Singular value decomposition section of 8 Foundational concepts
    - - After data centralisation on X, its covariance matrix and inner-product matrix are and
      - Therefore, we can utilise either U or V to produce PCs via or
  - - - Data centralisation
        
        For a given dataset X_(dxN) (d<N), calculate m = and make a "mean matrix": = {m, m, ..., m} .
        
        Then perform
      - Eigen analysis
        
        Calculate covariance matrix . Find out and rank all d eigenvalues so that with their corresponding eigenvectors, u1, u2, ..., ud
      - Construct projection matrix
        
        Select top p (p < d) eigenvectors of S to be PCs to form the projection matrix: P_(dxp) = {u1, u2, ..., up}
    - - Encoding: generate a low-dimensional representation of datapoint
        
        z = P^T (x - m), z is a p-dimensional representation of x
      - Decoding: reconstruct data point from it's low-dimensional represenation
        
        is an approximated d-dimensional vector of x
  - - - Data centralisation, like before
      - SVD solution to eigen analysis
        
        To use the V matrix, calculate Y(Nxd) . With the SVD . V(dxd) consists of the eigenvectors of
    - - Select top p (p<=N) eigenvectors of S to be PCs to form the projection matrix like before.
  - - - Data distributions may not satisfy the PCA assumption: orthogonal dimensions. We can use independent component analysis instead, which captures the maximum vairnace in non-orthogonal dimensions.
    - - In classification, the axis of maximum variance may be inconsistent with that of the largest discrimination. Linear discriminative analysis finds out the axis of largest discrimination by considering the label information.
    - - Data may not be distributed in linear subspace, while PCA can only cope with a linear manifold. Exended PCA techniques and manifold learning can deal with nonlinear manifold for low-dimensional representations.
- - - - To find the 1st pair of canonical vectors, a and b, of maximum correlation, we need to establush a utility function by encoding this learning goal.
      - Projecting X and Y onto a and b leads to z_X = X^Ta and z_Y = Y^Tb.
      - The correlation between two vectors z_X and z_Y is where S_XX, S_YY and S_XY are covaraince matrices on X, Y and XY i.e.
      - The denominators in correlation(z_X, z_Y) play a role of normalisation to make the correlation invariant with respect to rescaling a and b.
      - Thus, the learning objective to find out the 1st pair of a and b for maximum correlation is formulated along with two normalisation constraints:
      - To find out the optimal a and b, apply the optimality conditions:
      - Multiplying a^T and b^T on both sides of the equations and applying two further constraints , we have
      - are scalar and . Therefore, we must have . λa = λb = λ is the canonical correlation coefficient
      - By using the common λ, we rewrite the above equations as
      - Assume S_XX and S_YY are invertible, we rewrite the last equation as b = and insert it into the third equation
      - With eigen-analysis on the last equation, choose a* with the largest λ* and
      - The choice of a* and b* leads to the 1st pair of canonical vectors, a and b
    - - We can find more pairs of cannonical vectors with the same idea, e.g. the learning objectives of the 2nd pair of a and b as follows:
      - In general, M (M <= min(p,q)) pairs of canonical vectors, A = {a1, ..., aM} and B = {b1, ..., bM} are achieved by doing the eigen analysis on the composite matrix , to form A(pxM) with the eigenvectors a1, ..., aM of top M eigenvalues and B(qxM) with
      - Property: all M canonical vectors in A and B are uncorrelational, i.e. if i=j and 0 otherwise.
      - For dataset (X, Y), we can use CCA projection matrices A(pxM) and B(qxM) to achieve the M-dimensional representations:
  - - - Data centralisation
        
        Like before, but with a mean matrix for both X and Y matrices
      - Eigen analysis
        
        Calculate covariance matrices: . For find out and rank all p eigenvalues so that with their corresponding eigenvectors a1, ..., ap. Then, generate paired eigenvectors via
      - Construct projection matrices
        
        Select top eigenvectors to form the paired projection matrices: and
    - - , and
  - - - Data centralisation
        
        As before for both X and Y
      - SVD solution to eigen analysis
        
        Calculate covaraince matrices: . Apply SVD so as to where and .
      - Construct projection matrices
        
        Select top eigenvectors to form the paired projection matrices A(pxM) and B(qxM) where
    - - As before
- - - - For a centralised data matrix, X, its inner product (Gram) matrix, G_(NxN) = X^T X, is expressible by its distance matrix: where I_N is the identity matrix and e = [1 1 ... 1]^T
      - In the element-wise notation,
    - - G(NxN) = V(NxN)Σ(NxN)V^T(NxN) = Σ^N_(i=1) λivivi^T where vi is the ith eigenvector of G corresponding to the eigenvalue, , λi, and λ1 ≥ λ2 ≥ … ≥ λN.
      - Produce p-dimensional optimal coordinates (p<d) by setting , where Σ_p is a diagonal matrix of top p eigenvalues and V_p is the matrix of the corresponding N-dimensional eigenvectors, v1, ..., vp. In the element-wise notation: , i = 1, ..., p, j = 1, ..., N
  - - - L_ee: penalises large absolute error
      - L_ff: penalises large relative errors
      - L_ef (aka Sammon mapping when f(δ_ij) = δ_ij): trade-off between L_ee and L_ff
    - - Initialise p-dimensional embedded coordinated randomly, z^0_1, ..., z^0_N
      - Update
      - Repeat step 2 unitl a stopping condition is satisfied where d_kj = || z_k - z_j ||.
- - - - Finding out an approximation to geodesic distances between any points in high-dimensional space to establish a geodesic distance matrix
      - Apply cMDS to the geodesic distance matrix to achieve embedded coordinates in low-dimensional space where Euclidean distances between points are close to their corresponding geodesic distance.
    - - Determine neighbours based on Euclidean distances between points in dataset (ϵ-neighbourhood or K-NN)
      - With the neighbourhood information, construct a weighted graph where the exist only edges with weights between a point and its neighbouring points.
    - - Approximate the geodesic distances between all pairs of non-connected points without edges by estimating their shortest-path distance in the weighted graph
      - Finding out shortest-path distance in the weighted grapgh is a classical optimisation problem in graph theory (Dijkstra's algorithm)
    - - Based on approximation to geodesic distances between datapoints in dataset form a geodesic distance matrix in high-dimensional source space
      - Choose a proper dimension of low-dimensional target space, apply cMDS to the geodesic distance matrix for embedded coordinates of data points in low-dimensional Euclidean space
  - - - Based on Euclidean distance matrix in source space, ∆_X = (δX(i, j))(N×N), set graph, G, by connecting points i and j if δ_X(i, j) ≤ ϵ(ϵ-ISOMAP) or point i is one of the KNN of point j(K-ISOMAP). Set edge lengths equal to δ_X(i, j)
    - - Initialise δ_G(i, j) = δ_X(i, j) if i and j are linked by an edge or δ_G(i, j) = ∞ otherwise.
      - For k = 1, ..., N, replace all entries δ_G(i, j) by min{δ_G(i, j), δ_G(i, k)δ_G(k, j)}
      - Form the geodesic data matrix in source space: ∆_G = δG(i, j)(NxN)
    - - Convert the geodesic data matrix into its corresponding Gram matrix, G
      - Conduct spectral decomposition: G(NxN) = V(NxN)Σ(NxN)V^T(NxN) = Σ^N_(i=1)λ_iv_iv_i^T
      - Produce p-dimensional optimal coordinates (p<d) by setting where Σ_p is a diagonal matrix of top p eigenvalues and V_p is the matrix of the corresponding N-dimensional eigenvectors, v_1,..., v_p.
      - In the element-wise notation: i = 1, ..., p;j = 1, ..., N
  - - - To work, sufficient data points have to be sampled from the manifold smoothly
      - Work only on convex Euclidean manifolds to recover the intrinsic geometry
    - - ISOMAP does not provide any mapping function z = f(x)
      - For extension to unseen data, we can use the knwon raw data and theur embedded coordinates (X, Y) as training examples to learn a parametric mapping function z = f(Θ, x)
- - - - Stage 1: learn reconstructing one data point with its neighbours via linear parametric model; the learned parameters encodes the local geometric information in high-dimensional data space and such information can be propagated via neighbourhood connections of different points to approximate geodesic distances globally.
      - Stage 2: learn embedding coordinates by reconstructing one data point with its neighbouring points linearly in low-dimensional representation space with parameters learned in Stage 1 so that nonlinear manifold be embedded properly.
    - - Select neighbours (K-NN or ϵ-neighbourhood)
      - Stage-1 learning (reconstruct with linear weights)
        
        Given X_(dxN) = {x1, ..., xN} learn optimal linear weights, W*(NxN) = {w*_1, ..., w*_N}
        
        Loss function
        
        Where W_ij = 0 if point j is not a neighbour of data i. For any x ∈ X, the loss is
        
        Thus the learning objective is to minimise the constrained loss:
        
        Convert the learning objective into an unconstrained loss with Lagrangian multiplier:
        
        Optimisation: minimise L(w, λ;X) to find the optimal parameters, w* = [w* _1, ..., w *_K]^T wher C^-1 = (C^-1jk)(KxK) is the inverse matrix of C
      - Stage-2 learning (map to embedded coordinates)
        
        With the optimal linear weights W*(NxN), learn a mapping to obtain optimal embedded coordinates, Z*(pXN) (p<d).
        
        Loss function
        
        To facilitate optimisation, we can reformulate the loss function in a quadratic form with the following proposition:
        
        It can be shown that M = (1_N - W*)^T(I_N - W *) to store the matrix, M, efficiently
        
        Thus, the learning objective is to minimise the constrained loss:
        
        Optimisation: This is a standard form of quadratic programming problem. The solution needs to conduct eigen-analysis on the matrix M.
        
        Construct p-dimensional embedding coordinates
        
        Let λ1 ≤ λ2 ≤ · · · ≤ λN be the eigenvalues of M and v1, ..., vN be their corresponding eigenvectors.
        
        Produce p-dimensional embedded coordinates by setting z*i = v_i+1 or z*ij = v(i+j)
        
        Fact: the optimal embedding needs the bottom p+1 eigenvectors of matrix M, discarding the bottom eigenvector λ_1 = 0 and v_1 = [1 1 ... 1]^T
  - - - Determine neighbours with Euclidean distances between datapoints in high-dimensional space via ϵ-neighbourhood or K-NN
    - - For each data point x ∈ X, only use its neighbours, x_1, ..., x_K, to compute C = (Cjk)(KxK) where C_jk = (x - x_j)^T(x - x_k) and its inverse C^-1 = (Cjk)^-1(KxK)
      - Compute the optimal Langrangian multipler:
      - Compute the optimal linear weights:
    - - Conduct eigen-analysis on M = (I-W)^T(I-W ) and rank the eigenvalues, : λ1 ≤ · · · ≤ λN, and their corresponding eigenvectors, v 1, ..., v**N
      - Produce M-dimensional optimal embedded coordinates by setting z*i = v(i+1) or z*ij = v(i+j), i = 1, ..., p; j = 1, ..., N
  - - - To work, data points have to be sampled from the manifold uniformy
      - Sensitive to noisy sampling and the hyperparameter, K or epsilon
      - Unlike ISOMAP
        
        LLE does not make any assumption on manifolds but may not be able to recover complex non-convex manifold
        
        no robust way to decide the intrinsic dimension of embedding space
    - - Like ISOMAP (MDS), LLE does not provide any mapping function
      - For extension to unseen data, however, we can use the raw data and their embedded coordinates as training examples to learn a parametric mapping function e.g. support vector regressor or DNN
- - - - Given a set of k+1 affine transformations, it defines a depth k+1 "ReLU Deep Neural Net" as the following function
- - - - The generative model: sparse x ε R^h and y=Ax ε R^h and h > n
      - Define h := ReLU(Wy = ε->) = max{0, Wy = ε->} ε R
      - Then the autoencoder's output = , as we try to see how well we can reconstruct the input
      - And a typical loss function to be minimised would be: , were the second half is the regularisation term so we don't bias the algorithm to the low norm of the search space.