Please enable JavaScript.
Coggle requires JavaScript to display documents.
COMP41021 Representation Learning - Coggle Diagram
COMP41021 Representation Learning
Introduction
Data are vectors in High Dimensions
High quality images are made up of millions of pixels, which are each a set of real numbers
Models such as GPTs do predictive analysis over texts by splitting the sentence into character clusters/tokens, which are represented/embedded as a vector of real numbers.
Lots of modern ML needs to take a real valued function f as input. The function is sent to the model in an encoded form, such as a vector.
Often the coordinates of these data are not entirely independent and one can infer relationships among them, leading to low dimensional representations of the data.
Let Mathematics Guide Your High Dimensional Intuition
Diagrams can skew what we believe to be true, so don't do machine learning based on high dimensional diagrams as it leads to the wrong intuition.
8 Foundational Concepts
(8) Random Variables
Data is uncertain
As data in the real world is high dimensional, it brings geometrical complexities and ML algorithms need to work with uncertainty in what exactly it will be. We model this data using the language of "random variables" and we need formal ways of encoding these.
We focus on random variables which take values that can be written as vectors in n-real coordinates, and assume that they are describable by a pdf.
Most common examples of pdfs
Uniform distribution, p(x) = 1/4 for all x in [-2, 2] and 0 elsewhere, making these values of x equally likely.
Gaussian distribution, p(x,y), allows data to be any vector in the domain R^2, but it gets exponentially less likely farther than the origin.
(7) Risk Function
Assuming we have the basic notation to define supervised learning, we can also assume a choice of loss function which measures the error of the predictor, l(
w
,
x
, y) = 1/2(y - ⟨
w
,
x
⟩)^2, the squared error.
We also assume that there exists a distribution D - with distribution function p - on XY from which the data ⟨x,y⟩ is being sampled. That leads to the notation of "expected loss" or "
population risk
"
As we don't know the data distribution of p and we only have samples from it, we have to work with the "
empirical risk
"
The challenge of machine learning is to be able to minimise the population risk while only having access to empirical risk.
(6) Introduction to the Euclidean Norm
Definition - 2-Norm
Given a vector
v
∈ R^p we define its 2-Norm as,
i.e. the Euclidean distance of the point v from the origin.
(5) Introduction to Convexity
One of the common classes of functions that we shall encounter as examples of "loss" fuctions are such that they are differentiale and which are "convex".
Definition - Convexity
At least one differentiable function F: R^p -> R is convex if ∀x,y we have
where
is the derivative of F that is assumed to be well-defined.
Note that for differentiable functions, convexity can be thought of as the property that for any point x in the domain, the tangent to the function at that point is below the graph of the function
(4) Basics of Constrained Convex Optimisation
The key challenge of ML is to solve function minimisation questions under partial information, so we think about possible algorithms of function minimisation.
For now we will restrict to knowing a preliminary result, which is about minimising a differentiable convex function under full rank linear constraints
Most Basic Question of Convex Optimisation
For f: R^n -> R, a differentiable convex function, find min f(x) so that
AX
= b such that this minima exists as
A
∈ R^(Pxn) has rank p <= n (rows are linearly independent)
x*
∈ R^n solves the above question iff there exists a vector
λ*
∈ R^p such that
(You are only required to know how to use this result, not prove it)
Note the following interpretation: one could have defined as follows, the convex function L, called the
Lagrangian
,
, where the
λ
vector would be called the "Lagrange Multipliers". Then it is easy to see that the previously stated necessary and sufficient conditions for the optimal point are equivalent to solving for x
and
λ
such that
(3) Examples of Non-Convex Neural Losses
Restricting to convex functions would not suffice to capture the part of the course which uses neural nets, so here is an example of how easily one can get neural loss functions to be non-convex
Consider training data of , (x1, y1) = (0.5,−100), (x2, y2) =(−1, 300), (x3, y3) = (1, 1), (x4, y4) = (−0.5,−400). Recall that one of the simplest neural nets consists of a single sigmiod gate of weight w which would be mapping,
A natural instance of a regularised squared loss function, l below, on the above gate for the training data would be
Notice how we think of losses as being univariate functions of the weight(s)
The plot of this F above, for λ = 0.13, as a function of w shows that it isn't convex since it has a local maxima, and has two local minima, only one of which is a global minima.
(2) Introduction to Spectral Norms of Matrices
Definition - 2,2-Norm
For any matrix
A
∈ R^(mxn) we define its 2,2-Norm as
(sup is supremum or max). When it is clear from the context, the above will also just be called the 2-Norm of
A
and denoted as ||A||_2 or spectral norm of
A
.
Defintion - Eigenvector
A vector
v
∈ R^n is an eigenvector of the matrix
A
∈ R^(mxn) of eigenvalue λ if,
Av
= λ
v
An example of this considers the following family of matrices
. One can solve the equation
Av
= λ
v
to obtain the eigenvalues and eigenvectors for this matrix and realise that it' only eigenvalue is 0.
For an arbitrary vector on the unit circle in R^2 - parameterised as say
, we have
. This we have
. So by choosing θ arbitrarily large in magnitue, we can make spectral norm
A
(θ) as arge as we want, while the largest eigenvalue magnitude (or "spectral radius" of a matrix) remains as 0 for all θ.
In general, we can show
spectral radius
<=
spectral norm
, but the gap between them can be arbitrarily large.
(1) Singular Value Decomposition (SVD)
The ideas of vector spaces and eigenvalues are useful in factorisations called "Singular Value Decompositions" which exists for all matrices
Towards understanding SVD we define that a nxn dimensional matrix
O
will be said to be
orthogonal
if
OO
^T =
O
^T
O
=
I
_(nxn).
Next, we realise that if we have a mxn matrix
X
then
XX
^T and
X
^T
X
has all non-negative eigenvalues.
If
w
!= 0 is an eigenvector of
XX
^T with eigenvalue λ, then (
XX
^T)
w
= λ
w
->
w
^T(
XX
^t)
w
= ||
X
^T
w
||^2 = λ||
w
||^2 -> λ >= 0 - and similarly for
X
^T
X
Lastly, note that is
XX
^T
w
= λ
w
and λ > 0 then
X
^T
X
= λ
. One could have done a similar argument starting from
X
^T
X
. Hence we know that
XX
^T and
X
^T
X
have the same set of positive eigenvalues.
Now we have all the tools to define SVD
Note that in the general some of the colums of U and V can correspond to 0 eigenvectors of
XX
^T and
X
^T
X
respectively.
Introduction to Gradient Descent
What is gradient descent?
Gradient descent on a differentiable function f: R^p -> R
The key question - largely unresolved
"If F has a global minima, then when does G.D. find one of them and which one and how fast?"
Introduction to Proof of Convergence of Gradient Descent
Consider the case of a simple objective function F(x) = x^2
We can see the G.D. steps are,
Let's unroll x_(t+1) = (1 -2η_t)x_t to get
Note that if we have x_0 = 0, then it implies that for all times t, we would have x_t = 0, thus the algorithm would not move as all. More generally, if the algorithm were to start at a point x_0 so that f'(x_0) is a "critical point" of f, then the algorithm would not move at all, so critical points are not to be used as starting points.
Proof of convergence of gradient descent on x^2
We want x_t -> 0 (the global minima of f) as t -> ∞
From the recursion above it follows that a simple sufficient condition for that to happen is if for some constant k ∈ (0, 1), (1-2ηi) = k, ∀i - and this can be ensured by choosing a constant step-length as, ηt = ½ ⋅ (1-k).
We then have (1-2ηi) = k, ∀i, and we can rewrite what we previously had for xi+1+ as
Thus we have
, showing that for a range of choices of constant step-length, gradient descent converges on x^2 to its global minimum regarless of x_0.
But we can get more precise. To obtain convergence time ("non-asymptotic") bounds, we check how long it takes to get within a ε > 0 interval of the global minima.
Thus we have arrived at the key fact that GD on this simple function can converge in a very short time.
Principle Component Analysis
Background
An approach to exploratory data analysis, especially for high-dimensional data. One of the earliest used in pattern recognition and ML. Widely used for
data visualisation, data compression and feature extraction
Solves the problem of finding
linear orthonormal projections
to maximise data variance in a new space.
Intuitively finds a new
"coordinate system" of maximum variance
, axes names
principle components (PCs)
Principle
Recap
A
vector space
is represented by a set of
basis vectors, u1, u2, ..., ud
such that ∀
a
∈ R^d,
a
= a1
u1
+ a2
u2
+ ... + ad
ud
For any
linear model
used for representation learning, find a
linear transformation (projection matrix)
that carries out a pre-specified learning goal.
For
dimension reduction
, find p (p < d) basis vectors to form the projection matrix in linear transformation.
Data Centralisation
For a dataset X = (Xij)_(dxN) = {
x
1,
x
2, ...,
x
N}, we can get its
mean vector
Data centralisation is done by
x
n -
m
, n = 1, 2, ... N
Principle: PCA Derivation
Find the 1st principle component
Given a dataset of N datapoints in a d-dimensional space
Centralise the data as above
To find the 1st PC of
maximum variance, u
, we need to establish a utility function by encoding this learning goal via assoiciating the data with the parameter vector.
Let zn denote the
orthogonal projection
of the centralised data, onto the parameter vector of the unit length,
u
, in the new vector space. We have
i.e.
The mean of the projected data is zero due to data centralisation.
On the
u
basis (axis), we esimate the variance of
z
with m = 0:
Inserting zn = (
x
n -
m
)T
u
into this equation leads to
where S is the empirical
covariance matrix
of X
Thus, the learning objectives to find the 1st PC for maximum variance is formed:
Apply the
Lagrangian multiplier
to form the unconstrained utility function:
To find the optimal
u
, apply the
optimality condition
with respect to
u
.
. Thus we know
u
must be an eigenvector of S with eigenvalue λ.
Apply our findings regarding
u
and λ to the variance of
z
:
. To maximise Var(
z
), we must choose the
u*
with the largest λ*
The choice of
u*
leads to the 1st principle component
PCA Formulartion via Maximum Variance
We can find out more PCs with the same idea, e.g. the 2nd PC learning objectie as follows:
s.t.
In general p, (p < d) PCs
u
1,
u
2, ...,
u
p are achieved by doing the
eigen analysis
on the covariance matrix, S, and sort top p eigenvalues
Property
: all p PCs are orthonormal, i.e.
u
i^T
u
j = δ(i, j) = 1 if i = j and 0 otherwise, which act as a set of p axes to form a new "coordinate" system
Therefore, we can use PCA, P_(pxd) = {
u
1,
u
2, ...,
u
p}, to achieve a
p-dimensional representation, z
= [zn1, zn2, ..., znp]^T, for d-dimensional datapoints,
x
n.
or
, i = 1,2,...,p where
m
is the mean of
x
.
PCA Formulation via Optimal Reconstruction
Motivation
: From the p-dimensional representation,
z
n, of a d-dimensional data point,
x
n in the PCA space (p < d), we can
reconstruct
thie point in the original d-dimensional space (with some
reconstruction error
, e_n).
The
total quadratic reconstruction error
is measured by
, the sum squared length of the blue lines shown below. This, we can formulate via
minimising
the total reconstruction error.
From this perspective, we formulate the PCA learning objectives
where Z
(pxN) = {
z
1,
z
2, ...,
z
N} is a collective notation for low-dimensional representation of all data points in X and W
(dxp) = {
w
1,
w
2, ...,
w
p} indicating p basis vectors.
Minimising
with repect to Z and W along the constraints leads to
Property
: on the training dataset, X, it is also known as the total reconstruction error, E, is equal to the sum over the unused PC's eigenvalues
Here,
u
i is the ith PC, i.e. the ith eigenvector based on the ranked eigenvalues of S
Principle: Dual PCA
Problem
: in reality, there are more features than instances in
X
_(dxN) where (d > N), which leads to a computational infeasibility problem in eigen analysis for PCA
After data centralisation on X, its
covariance matrix
is
, a singular matrix (rank(s) <= N); i.e. at most N non-zero eigenvalues for S but very hard to achieve due to the need of massive computation for a large d.
It's inner product or
Gram matrix
is
, a non-singular matrix (rank(S') = N); N non-zero eigenvalues achievable much more efficiently from S'_(NxN)
It can be proven that
dual matrices, S and S'
, share the
same eigenvalues
Solution
: an eigenvector of S is achieved via the corresponding one of S'.
where
u
i and
v
i are the corresponding eigenvectors of S and S'. λ_i is their shared eigenvalue.
Principle: SVD Solution
Recap
Singular value decomposition section of 8 Foundational concepts
Solution to PCA
After data centralisation on X, its
covariance matrix
and
inner-product matrix
are
and
Therefore, we can utilise either
U
or
V
to produce PCs via
or
Fact: SVD provides a solution to dual PCA directly from the centralised X instead of S or S'
Algorithm: Basic PCA
Training phase: Find out PCA Projection matrix
Data centralisation
For a given dataset X_(dxN) (d<N), calculate
m
=
and make a "mean matrix":
= {
m
,
m
, ...,
m
} .
Then perform
Eigen analysis
Calculate covariance matrix
. Find out and rank all d eigenvalues so that
with their corresponding eigenvectors,
u
1,
u
2, ...,
u
d
Construct projection matrix
Select top p (p < d) eigenvectors of S to be PCs to form the projection matrix: P_(dxp) = {
u
1,
u
2, ...,
u
p}
Deployment Phase: Encoding vs Decoding
Encoding
: generate a low-dimensional representation of datapoint
z
= P^T (
x
-
m
),
z
is a p-dimensional representation of
x
Decoding
: reconstruct data point from it's low-dimensional represenation
is an approximated d-dimensional vector of
x
Algorithm: Dual PCA
Training Phase: Find out PCA projection matrix
Data centralisation
, like before
SVD solution to eigen analysis
To use the V matrix, calculate Y
(Nxd)
. With the SVD
. V
(dxd) consists of the eigenvectors of
Construct projection matrix
Select top p (p<=N) eigenvectors of S to be PCs to form the projection matrix like before.
Deployment Phase
also the same as before
Algorithm: Proportion of Variance
How to find out a proper dimension, p*, to form PCA space?
for k < d
In practice, find
k*
so that PoV(k*) >= 90%
and set that to p*
PoV(k) also works for the dual PCA on X_(dxN) (d >= N) although there are only up to N non-zero eigenvalues and at least d - N zero values. In this case, only check out k <=N to find out p*
Applications
PCA Application to visualisation of microarray data
PCA application to data compression for reconstructing numbers
Eigenface for data compression and feature extraction of facial images
Limitations of original PCA
Non-orthogonality
Data distributions may not satisfy the PCA assumption: orthogonal dimensions. We can use
independent component analysis
instead, which captures the maximum vairnace in non-orthogonal dimensions.
Inconsistency to discrimination
In classification, the axis of maximum variance may be inconsistent with that of the largest discrimination.
Linear discriminative analysis
finds out the axis of largest discrimination by considering the label information.
Non-linearity
Data may not be distributed in linear subspace, while PCA can only cope with a linear manifold. Exended PCA techniques and manifold learning can deal with nonlinear manifold for low-dimensional representations.
Extension of PCA
Logistics PCA
: extension to allow dealing with binary and categorical input features
Sparse PCA
: overcome weakness that PCs are usually a linear combination of all input features by using just a few variables
Non-linear PCAs
: principle curve analysis, to deal with non-linear manifold
Robust PCAs
: overcome the outliers weakness that caises large errors, e.g.
weighted PCA
Probabilistic PCA
: stochastic variant (
generative model
) that explains the PCA process from a data generation perspective.
Canonical Correlation Analysis
Background
An approach to exploratory data analysis especially for
two relevant random vectors
, e.g. visual and audio channels in video.
Widely used for
feature extraction
and
data interpretation
Goal: Find
optimal linear projections
for two relevant random vectors to
maximise their correlation
in the new spaces
Provides new
low-dimensional representations
of two random vectors, achieved by projecting data points to new CCA spaces
Principle: CCA Derivation
Find out the 1st pair of canonical vectors, a and b
Given a dataset of N data points (X, Y), X
(pXN) = {
x
1,
x
2, ...,
x
N} and Y
(qxN) = {
y
1,
y
2, ..., y
N} are
centralised**, respectively
To find the 1st pair of
canonical vectors
,
a
and
b
, of
maximum correlation
, we need to establush a utility function by encoding this learning goal.
Projecting X and Y onto
a
and
b
leads to
z
_X = X^T
a
and
z
_Y = Y^T
b
.
The
correlation
between two vectors
z
_X and
z
_Y is
where S_XX, S_YY and S_XY are covaraince matrices on X, Y and XY i.e.
The denominators in correlation(
z
_X,
z
_Y) play a role of normalisation to make the correlation invariant with respect to rescaling
a
and
b
.
Thus, the
learning objective
to find out the 1st pair of
a
and
b
for maximum correlation is formulated along with two
normalisation constraints
:
To find out the optimal
a
and
b
, apply the
optimality conditions
:
Multiplying
a
^T and
b
^T on both sides of the equations and applying two further constraints
, we have
are scalar and
. Therefore, we must have
. λa = λb = λ is the
canonical correlation coefficient
By using the common λ, we rewrite the above equations as
Assume S_XX and S_YY are invertible, we rewrite the last equation as
b
=
and insert it into the third equation
With eigen-analysis on the last equation, choose
a*
with the largest λ* and
The choice of
a*
and
b*
leads to the 1st pair of canonical vectors,
a
and
b
CCA Formulation
We can find more pairs of cannonical vectors with the same idea, e.g. the learning objectives of the 2nd pair of
a
and
b
as follows:
In general, M (M <= min(p,q)) pairs of canonical vectors, A = {
a
1, ...,
a
M} and B = {
b
1, ...,
b
M} are achieved by doing the eigen analysis on the composite matrix
, to form A
(pxM) with the eigenvectors
a
1, ...,
a
M of top M eigenvalues and B
(qxM) with
Property
: all M canonical vectors in A and B are
uncorrelational
, i.e.
if i=j and 0 otherwise.
For dataset (X, Y), we can use CCA projection matrices A
(pxM) and B
(qxM) to achieve the
M-dimensional representations
:
Algorithm: Basic CCA
Training Phase: Find out CCA projection matrices
Data centralisation
Like before, but with a mean matrix for both X and Y matrices
Eigen analysis
Calculate covariance matrices:
. For
find out and rank all p eigenvalues so that
with their corresponding eigenvectors
a
1, ...,
a
p. Then, generate paired eigenvectors via
Construct projection matrices
Select top
eigenvectors to form the paired projection matrices:
and
Deployment Phase: generate low-dimensional representations in CCA space
,
and
Algorithm: SVD-Based CCA
Training Phase: Find out CCA projection matrices
Data centralisation
As before for both X and Y
SVD solution to eigen analysis
Calculate covaraince matrices:
. Apply SVD so as to
where
and
.
Construct projection matrices
Select top
eigenvectors to form the paired projection matrices A
(pxM) and B
(qxM) where
Deplotment Phase: Generate low-dimensional representations in CCA space
As before
Algorithm: Proportion of Variance
As before except
In practice, find out k so that PoV(k) >= 90%. Then set M* = k.
In CCA, PoV(k) works up to min(p, q) non-zero eigenvalues achieved in the squared form with the basic or the SVD-based CCA algorithms.
Manifold Learning
Introduction
Manifold
: a research area in mathematics of topology and differential geometry. A d-dimensional space is said to be a manifold iff at each point their exists a neighbourhood that is homeomorphic to d-dimensional Euclidean space, R^d
Manifold hypothesis
: real-world high dimensional data often lie on low-dimensional (sub)manifolds embedded in the high-dimensional space.
Manifold learning
: discover and model low-dimensional manifolds via learning from data to form a low-dimensional
latent embedding
representation
Rigid versus non-rigid geometry
Rigid
shapes are invariant in Euclidean space, while inelastic
non-rigid
shapes are variant in Euclidean space (extrinsic space).
Extrinsic Euclidean distance does not work for non-rigid shapes.
There is
intrinsic space
for inelastic non-rigid shapes (non-linear manifold) where its intrinsic distance is invariant for any data points in non-linear manifold.
Mapping the non-rigid shapes to a
latent embedding
space where extrinsic distance works to preserve the intrinsic distance, which is done by manifold learning.
Illustrative examples
Visual perception - images of the same properties located in low-dimensional manifolds
Required to be captured in computer vision - many intrinsic properties underlying manifolds in computer vision. Accounts for appearance variation and shape deformation. Also allows for missing frames in the manifold to be interpolated.
Multi-dimensional Scaling (MDS)
Background
A
collection
of
dimension reduction
techniques that map the distance between observations in a high-dimensional
source
space into a low-dimensional
target
space.
Learn a configuration of points in target space of which inter-point distance
preserves
the distance between the corresponding points in source space.
To model the intrinsic manifold for
low-dimensional representation
and visualisation
Problem Formulation
Given a
distance matrix
in d-dimensiona;
source space
, ∆ = (δij)_(N×N), we have δii = 0, δij > 0 and δij = δji. (MDS doesn't need to know about dimensionality or distance metric in source space).
Given a p-dimensional
(Euclidean) target space
(p<d), this distance between two data points
z
i and
z
j is measured by
Problem
: Find out N datapoints Z_(pxN) = {
z
1, ...,
z
N} in p-dimensional space such that d_ij ≈ f(δij) where f(·) is a parametric
monotonic
increasing/decreasing function, e.g. f(δ_ij) = α + βδ_ij.
To ensire the solution is unique, the constraint Σ^N_(i=1)
z
i = 0 (mean point must be zero)
Most of MDS learning algorithms choose f(δ_ij) = δ_ij such that di_j ≈ δ_ij
Algorithm: Classical MDS (cMDS)
In d-dimensional
(Euclidean) source space
, a distance matrix comes from
unknown centralised
data matrix of N points, X_(dxN), where δ^2_ij = ||
x
i -
x
j ||^2 = (
x
i -
x
j)^T(
x
i -
x
j)
For a p-dimensional
(Euclidean) target space
(p>d), the distance between two data points,
z
i and
z
j is measured by d^2_ij = ||
z
i -
z
j ||^2 = (
z
i -
z
j)^T(
z
i -
z
j)
Problem
: Find out N data points, Z_(pxN), in p-dimensional space to minimise the
disparity
loss:
A solution is achieved by means of
Euclidean vector space
properties
For a
centralised
data matrix, X, its
inner product (Gram)
matrix, G_(NxN) = X^T X, is expressible by its distance matrix:
where I_N is the identity matrix and
e
= [1 1 ... 1]^T
In the
element-wise
notation,
By means of the
Euclidean vector space
property, we aso do the same in the
target
space: Z^T Z = -½HD^2H where D^2 = (d^2
ij)
(NxN)
Thus, the
disparity
loss can be rewritten as minimum of construction errors:
Solution
: Conduct
spectral decomposition
onf G=X^TX
G
(NxN) = V
(NxN)Σ
(NxN)V^T
(NxN) = Σ^N_(i=1) λivivi^T where
v
i is the ith eigenvector of G corresponding to the eigenvalue, , λi, and λ1 ≥ λ2 ≥ … ≥ λN.
Produce p-dimensional optimal coordinates (p<d) by setting
, where Σ_p is a diagonal matrix of top p eigenvalues and V_p is the matrix of the corresponding N-dimensional eigenvectors,
v
1, ...,
v
p. In the element-wise notation:
, i = 1, ..., p, j = 1, ..., N
Connection to PCA
: When X is known, projecting N points with top p PCs achieved from its covaraince matrix leads to the same embedding results (configuration) as MDS. Also PoV is applicable to cMDS for proper p.
Algorithm: Stress-based MDS
In MDS, many different
stress
(loss) functions proposed to carry out d_ij ≈ f(δ_ij)
Commonly used stress (loss) functions: L_ee, L_ff, L
ef w.r.t Z
(pxN)
L_ee
: penalises large absolute error
L_ff
: penalises large relative errors
L_ef
(aka Sammon mapping when f(δ_ij) = δ_ij): trade-off between L_ee and L_ff
Solution
: apply an optimisation method to minimise the loss functions w.r.t. Z
No analytic solution, but an
iterative
method e.g.
stochastic gradient descent
(SGD)
Initialise p-dimensional embedded coordinated randomly, z^0_1, ..., z^0_N
Update
Repeat step 2 unitl a stopping condition is satisfied
where d_kj = ||
z
_k -
z
_j ||.
Extension
Non-metric MDS (nMDS)
: dissimilarities are known only by their
rank order
e.g. alphabetically
Weighted MDS
: stress-based metric MDS is extensible to highlight certain aspects of dissimilarities in data
Symbolic MDS
: MDS is extensible to dealing with symbolic data of interval of dissimilarities and ditribution of dissimilarity instead of distance between two points
Interactive MDS
: an extension of MDS that allows for a new intuitive interaction with an MDS user on the interplay of the specific dissimilarites in data.
Isometric Feature Mapping (ISOMAP)
Background
Motivation
: tackle a fundamental problem in manifold learning i.e. given high-dimensional data sampled from an unknown low-dimensional manifold, how can we automatically recover good embedding?
Linear models, e.g. PCA, can only recover a linear manifold, but do not work for non-linear manifolds because Euclidean distance metrics do not respect the geometry of non-linear manifolds.
Solution
: find out proper distance metric that fits the geometry of nonlinear manifold. The
homeomorphic property of manifold
allows for an approximation of proper geofesic distance via the use of Euclidean distance locally.
Principle
Problem
: how to find out a transformation that preserves the geodesic distances between high-dimensional data points in a low-dimensional Euclidean space?
Key idea
: ISOMAP tackles this problem by
Finding out an approximation to geodesic distances between any points in high-dimensional space to establish a geodesic distance matrix
Apply cMDS to the geodesic distance matrix to achieve embedded coordinates in low-dimensional space where Euclidean distances between points are close to their corresponding geodesic distance.
Approximation to geodesic distance for neighbouring data points
Determine neighbours based on Euclidean distances between points in dataset (ϵ-neighbourhood or K-NN)
With the neighbourhood information, construct a weighted graph where the exist only edges with weights between a point and its neighbouring points.
Approximation to geodesic distance for distant data points
Approximate the geodesic distances between all pairs of non-connected points without edges by estimating their shortest-path distance in the weighted graph
Finding out shortest-path distance in the weighted grapgh is a classical optimisation problem in graph theory (Dijkstra's algorithm)
Apply cMDS for low-dimensional embedding
Based on approximation to geodesic distances between datapoints in dataset form a geodesic distance matrix in high-dimensional source space
Choose a proper dimension of low-dimensional target space, apply cMDS to the geodesic distance matrix for embedded coordinates of data points in low-dimensional Euclidean space
Algorithm
Construct neighbourhood graph
Based on Euclidean distance matrix in source space, ∆_X = (δ
X(i, j))
(N×N), set graph, G, by connecting points i and j if δ_X(i, j) ≤ ϵ(ϵ-ISOMAP) or point i is one of the KNN of point j(K-ISOMAP). Set edge lengths equal to δ_X(i, j)
Compute shortest paths
Initialise δ_G(i, j) = δ_X(i, j) if i and j are linked by an edge or δ_G(i, j) = ∞ otherwise.
For k = 1, ..., N, replace all entries δ_G(i, j) by min{δ_G(i, j), δ_G(i, k)δ_G(k, j)}
Form the geodesic data matrix in source space: ∆_G = δ
G(i, j)
(NxN)
Construct p-dimensional embedded coordinates with cMDS
Convert the geodesic data matrix into its corresponding Gram matrix, G
Conduct spectral decomposition: G
(NxN) = V
(NxN)Σ
(NxN)V^T
(NxN) = Σ^N_(i=1)λ_i
v
_i
v
_i^T
Produce p-dimensional optimal coordinates (p<d) by setting
where Σ_p is a diagonal matrix of top p eigenvalues and V_p is the matrix of the corresponding N-dimensional eigenvectors,
v
_1,...,
v
_p.
In the element-wise notation:
i = 1, ..., p;j = 1, ..., N
Relevant Issue
Limitation
To work, sufficient data points have to be sampled from the manifold smoothly
Work only on convex Euclidean manifolds to recover the intrinsic geometry
Out-of-sample extension
ISOMAP does not provide any mapping function
z
= f(
x
)
For extension to unseen data, we can use the knwon raw data and theur embedded coordinates (X, Y) as training examples to learn a parametric mapping function
z
= f(Θ,
x
)
Extension
Conformal ISOMAP
: relax the manifold assumption by preservinf manifold orientation instead of geodesic distance
C Isomap
: allow for magnifying the regions of high density but shrinking the regions of low density
Incremental ISOMAP
: allow for online ISOMAP learning by embedded points one by one instead of training in a batch manner
Landmark ISOMAP
: overcome high-computational burden in learning by using landmarks, only a subset of representative data
Robust ISOMAP
: replace Dijktra path-based geodesic distance estimates with parallel transport unfolding approximation for robustness to noise
Locally Linear Embedding
Background
Motivation
: tack the same fundamental problem as ISOMAP encounters in manifold learning i.e. a given high-dimensional data sampled from an unknown ow-dimensional manifold, how to recover a good embedding automaticallY?
To model intrinsic nonlinear manifold for low-dimensional representation and visualisation
Solution
: find out poper distance metric that fits the geometry of nonlinear manifold. This is difficult for an arbitrary non-linear manifold. The
homeomorphic property of manifold
allows for an approximation of proper geodesic distance via the use of Euclidean distance locally.
LLE adopts a key idea,
think globally, fit locally
to learn preserving geodesic distance.
Principle
Problem
: how to find out a transfomation that preserves the geodesic distances between high-dimensional data points in a low-dimensional Euclidean space?
Key idea
: unlike ISOMAP, LLE tackles this problem by two-stage learning
Stage 1
: learn
reconstructing one data point with its neighbours
via linear parametric model; the learned parameters encodes the local geometric information in high-dimensional data space and such information can be propagated via neighbourhood connections of different points to approximate geodesic distances globally.
Stage 2
: learn
embedding coordinates
by reconstructing one data point with its neighbouring points linearly in
low-dimensional
representation space with parameters learned in Stage 1 so that nonlinear manifold be embedded properly.
Main steps in LLE
Select neighbours
(K-NN or ϵ-neighbourhood)
Stage-1 learning
(reconstruct with linear weights)
Given X_(dxN) = {
x
1, ...,
x
N} learn optimal linear weights, W*
(NxN) = {
w*
_1, ...,
w*
_N}
Loss function
Where W_ij = 0 if point j is not a neighbour of data i. For any
x
∈ X, the loss is
Thus the learning objective is to minimise the constrained loss:
Convert the learning objective into an unconstrained loss with Lagrangian multiplier:
Optimisation
: minimise L(
w
, λ;X) to find the optimal parameters,
w*
= [w
* _1, ..., w
*_K]^T
wher C^-1 = (C^-1
jk)
(KxK) is the inverse matrix of C
Stage-2 learning
(map to embedded coordinates)
With the optimal linear weights
W*
(NxN), learn a mapping to obtain optimal embedded coordinates,
Z*
(pXN) (p<d).
Loss function
To facilitate optimisation, we can reformulate the loss function in a quadratic form with the following proposition:
It can be shown that M = (1_N - W
*)^T(I_N - W
*) to store the matrix, M, efficiently
Thus, the learning objective is to minimise the constrained loss:
Optimisation
: This is a standard form of quadratic programming problem. The solution needs to conduct eigen-analysis on the matrix M.
Construct p-dimensional embedding coordinates
Let λ1 ≤ λ2 ≤ · · · ≤ λN be the eigenvalues of M and
v
1, ...,
v
N be their corresponding eigenvectors.
Produce p-dimensional embedded coordinates by setting
z*
i =
v
_i+1 or z*
ij = v
(i+j)
Fact: the optimal embedding needs the bottom p+1 eigenvectors of matrix M, discarding the bottom eigenvector λ_1 = 0 and
v
_1 = [1 1 ... 1]^T
Algorithm
Select neighbours for each data point
Determine neighbours with Euclidean distances between datapoints in high-dimensional space via ϵ-neighbourhood or K-NN
Stage-1 learning: reconstruct with linear weights
For each data point
x
∈ X, only use its neighbours,
x
_1, ...,
x
_K, to compute C = (C
jk)
(KxK) where C_jk = (
x
-
x
_j)^T(
x
-
x
_k) and its inverse C^-1 = (C
jk)^-1
(KxK)
Compute the optimal Langrangian multipler:
Compute the optimal linear weights:
Stage-2 learning: map to embedded coordinates
Conduct eigen-analysis on M = (I-W
)^T(I-W
) and rank the eigenvalues, : λ1 ≤ · · · ≤ λN, and their corresponding eigenvectors,
v
1, ...,
v**N
Produce M-dimensional optimal embedded coordinates by setting
z*
i =
v
(i+1) or z*
ij = v
(i+j), i = 1, ..., p; j = 1, ..., N
Relevant Issue
Limitations
To work, data points have to be sampled from the manifold uniformy
Sensitive to noisy sampling and the hyperparameter, K or epsilon
Unlike ISOMAP
LLE does not make any assumption on manifolds but may not be able to recover complex non-convex manifold
no robust way to decide the intrinsic dimension of embedding space
Out-of-sample extension
Like ISOMAP (MDS), LLE does not provide any mapping function
For extension to unseen data, however, we can use the raw data and their embedded coordinates as training examples to learn a parametric mapping function e.g. support vector regressor or DNN
Extension
Hessian LLE
: replace linear reconstruction with the Hessian operators at each data point in Stage 1 learning for robustness
Modified LLE
: use multiple weights in each neighbourhood to remedy distorted LLE maps
Diffusion Map
: replace linear reconstruction with Laplacian of graph having local similarities at differet scaled for global data description
Local Tangent Space Alignment
: inspired by LLE, carry out intuition that when a manifold is correctlt unfolded, all of the tangent hyperplanes to the manifold would become aligned.
Basics of Neural Nets
Activation Gates of the Neural Network
The building block of a "neural net" are its activation gates which do the basic analogue computations
The above is R^3 -> R neural gate evaluating the "activation" function f: R-> R on a linear transformation of the input vector. The
w
s are the "weights".
The ReLU activation function
For most purposes the best activation function to use is the "Rectified Linear Unit (ReLU)"
And one affine layer (say A) of the net corresponds to the following map for some matrix
and a vector
,
Example of a One Layer ReLU Net
Including a ReLU activation, one defines a layer of a neural net as
Working through
, the max{} works on each value in the vector
Defining Neural Nets
The key component of a neural net is
activation function
, with the most widely used being ReLU
What is a neural net?
Given
a set of k+1 affine transformations, it defines a depth k+1 "ReLU Deep Neural Net" as the following function
Basic Setup of Neural Net Objectives
For a data distribution D, a finite set of samples from it, S, a loss function, L, and a neural net, N, we define its Population Risk, R, and Empirical Risk, Rhat, as
We view the above two as real functions which compute different risks as N varies across different functions on the architecture (as the weights, w, change). Then we can write R(N) and R(w).
For the most basic kind of depth 2 ReLU nets, consider the following example of Population Risk function as defined
This is the standard squared loss implemented on the neural net mapping
There are special cases of the above expectation when it can be exactly evaluated. Neural nets where the input and output are of the same dimension are called
autoencoders
Intro to Autoencoders
Sparse-coding: an instructive setup with Autoencoders
The defining equation of our autoencoder computing
from
The generative model: sparse
x
ε R^h and y=A
x
ε R^h and h > n
Define h := ReLU(W
y
= ε->) = max{0, W
y
= ε->} ε R
Then the
autoencoder's
output =
, as we try to see how well we can reconstruct the input
And a typical loss function to be minimised would be:
, were the second half is the regularisation term so we don't bias the algorithm to the low norm of the search space.
Sparse-coding by Autoencoders is an example of Generative Modelling
The typical autoencoder loss function can be re-written as
Notice that in above
W
^T is of the same dimensions as A
. Now if under certain special conditions we can minimise the above loss function and it turns out that the minimising
W
^T is very close to A
- then we would have approximately figured out how the inputs to the autoencoder y has been generated from the simpler sparse vectors/"source codes" x*.
Hence, this is a simple example of generative modelling
Building more intuition about the autoencoding loss function
One special case where there is a known correct global minima. We observe that the standard squared loss, ||
y
-W^TReLU(W
y
)||^2_2, evaluates to its minimum possible value of 0 at W^T = A in a certain special case when y = Ax such that A is orthogonal and x >=0
But even in this case, we do not currently know how to prove an optimisation algorithm will actually recover A from given samples of
y
as what the loss function needs.
A brief view of Bottleneck Autoencoders
Neural networks can be visualised to be capable of learning nonlinear relationships between data and labels, and can be thought of as a more powerful generalisation of PCA.
Generative Modelling
Introduction via Latent Variables