Please enable JavaScript.
Coggle requires JavaScript to display documents.
COMP41011 Foundations of Machine Learning - Coggle Diagram
COMP41011 Foundations of Machine Learning
Introduction to Machine Learning
Machine Learning or Statistical Learning
We would like to design an algorithm that helps us to solve different prediction problems, based on a mathematical model/function and database.
Examples of ML problems
Handwritten digit recognition
Face detection and recognition
Age prediction
Stock market prediction
Clustering and segmentaton of customers
Recommendation systems
Definitions
Feature extraction
: transforming an image into a vector
X
Instance
: the pair (
X
, y), where y is the image label
Objective
: find a function f(
x
,
w
)
Training set
: A set of N images and their labels to fit the predictive model.
Estimation/Training phase
: the process of getting the values of
w
of the function f(
x
,
w
) that best fits the data
Generalisation
: ability to correctly predict the label of new images
x
Overtraining
: learning the data instead of the patterns
Underfitting
: model is too simple to capture the complexity of the data
Supervised learning
Variable y is discrete: classification
Variable y is continuous: regression
Unsupervised learning
We only have access to instances, not the lables so we:
Find similar groups (clustering)
Find a probability function for
X
: density estimation
Find a lower dimensionality representation for
X
(dimensionality reduction and visualisation)
Main Challenges of Machine Learning
Insufficient quality of training data
Non-representative training data
Poor-quality data
Irrelevant features
Overfitting the training data
Underfitting the training data
An E2E ML Project
Machine Learning Project Checklist
Frame the problem and look at the big picture
Get the data
Explore the data to get insights
Prepare the data to better expose the underlying data patterns
Explore many different models and shortlist the best ones
Fine-tune your models and combine them into a solution
Present your solution
Frame the problem and look at the big picture
Set the problem in context
How will the solution be used?
What are the current solutions (if any)?
Is it a regression or classification problem?
How is performance measured?
What would be the minimum performance?
Are there comprable problems?
Is human expertise available?
Model performance assessment
We use several metrics to asses the performance of a model
Metrics in regression
Root mean squared errror (RMSE)
This is the preferred performance measure for regression tasks
Mean absolute error (MAE)
Used when there are many outlier districts
Metrics in classification
Precision
The ratio of correct positive predictions to the overall number of positive predictions
= TP / (TP + FP)
or the fraction of relevant documents in the documents returned
Accuracy
the ratio of examples correctly classified over the total number of examples classified
= (TP + TN) / (TP + TN + FP + FN)
useful when errors predicting all classes are equally important, but can be misleading in imbalanced classification problems.
Recall
The ratio of correct positive predictions to the overall number of positive examples in the dataset
= TP / (TP + FN)
or the fraction of relevant documents returned to the total relevant documents
Confusion matrix
TP -
True Positive
, FP -
False Positive
, TN -
True Negative
, FN -
False Negative
A table that summarises how successful the classification model is at predicting examples belonging to various classes
Get the data
Be aware of:
What data do you need and is it enough?
Legal obligations regarding the data
Removing sensitive information
Check the data size and type
Sample a test set and do not look
You can get data from many data repositories and datasets
Three sets
Training
Largest set
Used to fit the model using the objective function
For a large dataset use 70-95% for training
For smaller dataset use as many training instances as possible and use cross-validation to make better use of this and the validation set
Validation
Same size as test
Used to choose the best predictive model among a set of candidates
For a large dataset use 15-2.5% for validation
Test
Same size as validation
Used to asses the final performance of the model before shipping the model to production.
For a large dataset use 15-2.5% for testing
k-fold cross validation
OR Leave-one-out cross validation
Explore the data
Take into account
Create a copy of the data for exploration or sample a fraction of a large dataset for exploration
Study each feature and its characteristics including
name, noisiness, type, percentage of missing values, type of distribution, is the feature useful?
Identify the target attribute (for supervised learning)
Visualise the data
Study correlations between attributes
Identify what transformations you might be able to apply
Is there any additional data that you might find useful?
Study correlations between attributes
The correlation coefficient between two RVs X and Y is given as
which is between -1 and 1 and σx,y is the covariance between X and Y
Prepare the data
Data cleaning
Remove outliers (optional)
Handle missing values
Fill them in using the mean, median or other value
Drop the feature if most of the instances have a missing value
Drop an instance if you have several instances with missing values
Add an additional feature describing whether the instance is missing the value
Most ML methods require features that are numbers rather than categories, so we use one-hot encoding to obtain a higher dimensional binary representation for each value, as when values don't have a natural order they should not be mapped in order.
Feature selection and feature engineering
Remove features that are uninformative
Discretize continuous features (binning)
Create new features from available ones
Feature scaling
Several ML methods do not perform well when input features have very different scales, so we try to get all features to the same scale
Normalisation (min-max scaling)
Map the range of values that a feature takes to the range [-1, 1] or [0, 1]
Standardisation (z-score normalisation)
The features are scaled so that they have mean 0 and standard devation 1.
Which scaling to use?
If the dataset is small and there is time, try both and see which performs better. Else:
Unsupervised learning algorithms usually benefit more from standardisation
Standardisation also works when the feature has already a distribution close to Gaussian
If a feature has outliers, standardisation is also preferred
Normalisation is preferred in all other cases
Shortlist promising models
If the data is big, use a smaller subset of the data to try different models in a reasonable time
Try different models from several categories (linear, non-linear, non/probabilistic)
Measure and compare the performances of the models, using the same data in the training and validation sets
Shortlist two to three models that look promising
Fine-tune the system
Use as much data as you can at this stage
Fine tune the hyperparameters of your model using cross-validation
Use random or grid search to explore hyperparameters
Consider the data transformations
Use Auto-ML to fine-tune
Once your are confident about your final model, merge the training and validation sets and fit the model again
Finally use the test data to assess the generalisation error of your ML system (do not change your model as you might overfit to the test data)
Present your solution
Document what you have done
Create a nice presentation
Explain why your solution achieves the business objective
Bring up interesting points you noticed
Review of Probability
Definitions
Random Variable
(RV): a function that assigns a number to the outcome of a random experiment
Discete RV
: can take a value only from a countable number of distinct values
Continuous RV
: can take any value from infinite possible values within one or more intervals
Discrete Random Variables
Probability mass function
A discrete RV X is completely defined by a set of values it can take x1, x2, ..., xn and their corresponding probabilities
The probability that X = xi is denoted as P(X = xi) for i = 1, ..., n and is known as the probability mass function (pmf)
Two discrete RVs
In machine learning, we are usually interested in more than one random variable and they can be fully described with a joint probability mass function P(X=xi, Y=yj)
Rules of probability
Marginal
We obtain the probability of P(X=xi) regardless of the value of Y. This is known as the
sum rule of probability
Conditional
From this we can also write P(X = xi, Y = yj) = P(X = xi | Y = yj)P(Y = yj). This is known as the
product rule of probability
We compute P(X = xi) from data by approximating the probability of that outcome by the number of instances / the number of trials. The approximation becomes an equality as the number fo trials nears infinity.
Statistical independence
Two discrete RVs are statistically independent if P(X = xi , Y = yj) = P(X = xi)P( Y = yj), i = 1, . . . , n j = 1, . . . , m
Bayes' Theorem
Expected value and statistical moments
The expected values or statistical moments of the discrete RV X, used frequently are the mean and the variance.
The squared root of the variance is known as the standard deviation.
Continuous Random Variables
Probability density function
A continuous RV X takes values within one or more intervals of the real line
We use pdfs, pX(x), to describe a continuous RV X.
Two continuous RVs
Additionally we are often interested in more than one continuous RV, so we use a joint pdf, pX,Y(x, y), to fully characterise two continuous random variables.
Rules of probability
Sum rule of probability
where px(x) is known as the marginal pdf
Product rule of probability
which can also be written as pX,Y(x, y) = pX|Y(x|y)pY(y)
Bayes theorem
Statistical independence
We say that two continuous RVs are statistically independent if pX,Y(x, y)= pX(x)pY(y)
Expected values and statistical moments
For continuous RVs, expected values are computed as
Review of vector/matrix notation and linear algebra
Definitions
scalar
numeric value
vector
ordered list of scalar values
matrix
rectangular array of scalars arranged into rows and columns
matrix transpose
matrix multiplication
if
A
has dimensions pxq and
B
has dimensions txs then matrix multiplication os the form
AB
is only possible if q = t
if so then
Products
Transpose of a product
The transpose of the product
Xw
, (
Xw
)T =
w
T
X
T
Inner product
between two vectors results in a scalar
Outer product
between two vectors results in a matrix
We transform scalar operations to matrix vector operations because they are handled efficiently by low level routines already included in modules like numpy, instead of using loops for scalar operations.
Vector/matrix differentiation
Other types of matrices
Identity matrix
A square matrix with 1s on the main diagonal (where i = j) and 0s elsewhere
Inverse matrix of a matrix
A
of dimensions dxd, denoted as
A
^-1, satisfies
AA
^-1 =
A
^-1
A
=
1d
Maximum Likelihood: Iterative solution
Coordinate Descent
We find the update equation for each of the parameters by differentiating the error function and rearranging to find the specific parameter, and finding the second derivative to ensure that the point is a minimum
We are trying to minimise the error function, by updating the parameters to their optimal values, moving one at a time.
We start by using the maximum likelihood update to find an estimate for the first parameter
The update equations are alternated in a loop until the get a good fit and meet the
stopping criterion
, in this case, the difference between the previous error and the current is < 1e-4
Multiple Input Solution with Linear Algebra
As coordinate descent can be slow, we can use linear algebra to get straight to the minimum.
The
predictive function
we are trying to fit has the form f(x) = mx + c. From a linear algebra perspective, we are looking at multiplications and additions.
We also want to separate our parameters from out data (the
givens
) by making two vectors that we can create an inner product of to fit the function.
We do this by writing f(x) = m x c + c x 1. We then express the data and parameters as
and
and the inner product
x
^T
w
is equal to the predictive function.
Design Matrix
formed by combining each
x
_i into one vector
Objective Function
and
Any sum of squares can be represented by an inner product
b
^T
b
, so if we define all values in a vector
y
and every function f(
x
_i;
w
) in a vector
f(X;w)
or just
f
, the objective function can be written as
Aside:
is a
vector-valued function
where the elements are defined as fuction
If we incorporate the design matrix
X
, we get
f
=
Xw
, and with this result we have defined the model with the two equations, the one to tell us the form of the predictive function and the other, the form of the objective function
Objective Optimisation
Model
contains a function that shows how it will be used for prediction and a function that describes the objective fuction we need to optimise to obtain a good set of parameters
We find the minima of the objective function using multivariate calculus to find the stationary points.
We define vectorial derivatives as
where each part is known as the
partial derivative
of the error function with respect to w_1 and w_2
Matrix differentiation
Differentiation of an inner product
Differentiation of a 'matrix quadratic'
A scalar quadratic in z with coefficient c = cz^2. If
z
(kx1) and
c
(kxk) the cz^2 =
z
^T
Cz
which is a scalar, but a function of a vector
Matching dimensions in matrix multilications
AB
where
A
(kxp) and
B
(qxr) only works when p = q and the dimensions of
AB
would be kxr.
Matrix multiplication is not commutative unless the matrices are square and symmetric.
Differentiating the objective
Update Equation for Global Optimum
We seek stationary points by finding the vectors that solve for when the gradients are 0.
Solving the multivariate system
The solution for
w
is given in terms of a matrix inverse, but computation of a matrix inverse requires an algorithm. It is better to ask the computer to solve the system of equations given by
for
w
using numpy.linalg.solve
Solution with QR Decomposition
A
QR Decomposition
of a matrix factorises it into a matrix which is an orthogonal matrix
Q
, and a matrix which is upper triangular (a square matrix where all the entries below the main diagonal are zero),
R
.
This is a more numerically stable solutions beause it removes the need to compute
X
^T
X
as an intermediate and reducing numerical precision.
Basis functions
Multivariate linear regression allows us to build models that take many features into account when making our prediction, but if we only have one input value we can artificially generate more using basis function.
Non-linear in the input
When we refer to non-linear regression, we are normally referring to whether the reression is non-linear in the input space or
covariates
, the observations that move with the target variable e.g. the vector of covariates
x
_i associated with the ith observation corresponding to the target
y
_i
If a model is non-linear in the inputs, it means that there is a non-liner function between input and target space
An easy way to make the linear regression non-linear is to introduce non-linear functions, known as basis functions
Instead of working directly on the input space
x
, we build models in a new space Φ_1 = 1, Φ_2 = x, Φ_3 = x^2 and we can consider them together by placing them in a vector Φ(
x
) = [1 x x^2]^T
When we consider the vector valued function for each data point, we can place all the data into a matrix
where we are still in the 1D input setting so
x
represents a vector of inputs with n elements
The polynomial basis extends the quadratic basis to an arbitrary degree so we can defin the ith basis function associated with the model as
We often scale the input
x
to be in the range [-1, 1] as higher order polynomials are better behaved in this range
Non-linear, but linear in the parameters
Although this example is a non-linear regression, it is still a linear model because it is linear in the parameters, f(
x
) =
w
^TΦ(
x
)
In practice basis functions may contain its own set of parameters f(
x
) =
w
^TΦ(
x
;
θ
)
Linear regression
Linear model
A simple model for regression consists of using a linear combination of the attributes to predict the output f(
x
,
w
) = w0 + w1x1 + ... + wDxD, where
w
are the parameters of the regression model and w0 is the bias term/intercept. This can also be written f(
x
,
w
) =
w
^T
x
Gaussian pdf
Takes the form
and requires the mean and variance of the RV Y
It is denoted as p(y | µ, σ2 ) =
or
Gaussian regression model
We use a gaussian regression model to relate the inputs and outputs y_i = f
(x
_i,
w
) + ϵ where
It assumes that each output y_i can be explained as the prediction of an underlying model plus a noise term. For a fixed
x
and a fixed
w
, f(
x
,
w
) is a contant then y = constant + ϵ, where ϵ is a continuous RV.
What is the pdf for y?
E{y} = E{constant + ϵ} = constant
var{y} = var{constant} + var{ϵ} = σ^2
This means that
, and the given constant means
Because we assumed and
x
and
w
are given, we can also write
If we knew that value for
w
, once we have a new
x*
, we can predict the output as f(
x*
,
w
) =
w
^T
x
, where σ^2 tells us the noise variance.
How do we estimate
w
?
We start with a training dataset (
x
1, y1), ..., (
x
N, yN) and assume that the random variables y1, yN are independent and follow an identical (Gaussian in this case) distribution
Both assumptions go by the name of the iid assumption,
independent and identically distributed
Putting both assumptions together, we get
where
The expression above can we written as
When we look at a Gaussian pdf like above, we assume that both the mean and variance are given. In this case, the pdf follows all the properties that were reviewed before, and the same is true for
Given
w
^T
x
n and σ^2, then each p(yn |
x
n,
w
, σ^2) is a pdf, but with unknown values for
w
and σ^2, each p(yn |
x
n,
w
, σ^2) is not a pdf anymore.
In that case, the function
becomes a "likelihod function",
and we can use multivariate calculus to find the values of
w
, σ^2 that maximises h(
w
, σ^2).
In statistics, this is known as the maximum-likelihood (ML) criterion to estimate parameters.
Given
y
,
X
we use the ML criterion to find the parameters
w
and σ^2 that maximise
In practice, we prefer to maximise the log of the likelihood p(
y
|
X
,
w
, σ^2) or minimise the it's negative log likelihood.
Consistency of the ML criterion
If data was really generated according to the probability we specified, the correct parameters will be recovered in the limit as N → ∞
Connection with the sum of square errors
If we multiply LL(
w
, σ^2) by -1, we get
The ML criterion for this model has a close connection with the sum-of-squared errors used in non-proabilistic formulations of linear regression.
Maximising the log-likelihood function is equivalent to minimising the sum-of-squares error.
Notice that the log is a monotonic function, meaning that if we find
w
, σ^2 that minimise h(
w
, σ^2), those will also maximise log(h(
w
, σ^2))
Normal equation
Let us find as estimate for
w
. We know
This expression can be written in vectorial for as
Focusing on the term (
y
-
Xw
)^T(
y
-
Xw
), this equals
y
^T
y
–
w
^T
X
^T
y
–
y
^T
Xw
+
w
^T
X
^T
Xw
We can find the
w
that maximises LL(
w
, σ^2) by taking the gradient
, equating to zero and solving for
w
.
Taking the gradient of each term in LL(
w
, σ^2) wrt
w
we get
Putting these terms together we get
And equating to zero and solving for
w
, we get
The expression for
w*
is known as the
normal equation
and the solution exists if we can compute (
X
^T
X
)^-1. The inverse can be computed as long as
X
^T
X
is non-singular (i.e. has non-zero determinant, or full rank)
Solving for σ^2*
Following a similar procedure, it can be shown that the ML solution for σ^2* is given as
Basis functions
The model that is linear in
x
only allows linear relationships between
x
and y. We can exrend the model to describe non-linear relationships by using
basis functions
, non-linear mappings from inputs to outputs.
We can keep the linear relationship of y wrt
w
for tractablity
The predictive model follows as f(
x
,
w
)
where ϕ_i(x) are basis functions and we have M+1 parameters for the vector
w
and ϕ(
x
) = [1, ϕ1(
x
), …, ϕM(
x
)]^T
Examples
Transforming the input using the basis functions
Using the Olympic dataset example and using a polynomial basis function to predict y, the time
For each year, x, of the compitiion, we compute the vector of the polynomial basis functions
We have converted the unidimenional input feature into a higher dimensional feature representation ϕ(x) ∈ R^(M+1)
Normal equations with a design matrix
Given
X
, we first compute a new design matrix
Φ
We can now use (
y
,
Φ
) and write the Gaussian linear regression problem
where ϕ_n = ϕ(
x
_n)
Using the ML criterion, we arrive to the following normal equation
w*
= (
Φ
^T
Φ
)^(-1)
Φ
^T
y
Alternative to find
w
For solving the normal equation, we need to invert
X
^T
X
This inversion has a computational complexity between
to
The normal equation is linear regarding the number of instances in the training data, O(N)
It can handle a large trainnig set as long as it fits in memory
Alternatively, we can use iterative optimisation in cases with a large number of features and too many instances to fit in memory
Gradient Descent
General problem
We are given the function h(
w
) where
w
∈ R^p, and we aim to find a value fo
w
that minimises the function.
We use an iterative procedure
w
_(k+1) =
w
_k + η
d
_k, where
d
_k is known as the search direction such that h(
w
_k+1) < h(
w
_k)
The parameter η is known as the
step size
or
learning rate
Gradient descent
The simplest algorithm for unconstrained optimisation
Assumes that
d
_k = -
g
_k where
g
_k =
g
(
w
_k)
Also known as
steepest descent
It can be written like
w
_(k+1) = η
g
_k
Step size
The main issue in gradient descent is how to set the step size, as too small slows convergence and too large can mean a failure to converge
Alternatives to choose the step size η
Line search methods (may use search directions other than the steepest descent direction)
Conjugate gradient (method of choice for quadratic objectives g(
w
) =
w
^T
Aw
)
Use a Newton search direction
Gradient descent for linear regression
Let's assume that the objective function h(
w
) corresponds to the sum of squared errors
. We could also minimise the negative LL(
w
) instead NLL(
w
).
We write the update equation as
Computing the gradient for E(
w
), we get
The update equation follows as
w
_(k+1) =
w
_k - η
X
^T(
Xw
_k -
y
)
The computation of the gradient involves using the whole dataset (
X
,
y
) at every step. For this reason, this algorithm is known as batch gradient descent.
Gradient descent and feature scaling
Always normalise the features if using gradient descent
Gradient descent converges faster if all features have a similar scale
If the attributes are in very different scales, it may take a long time to converge.
Stochastic Gradient Descent
Online and large datasets
Traditionallt in machine learning, the gardient
g
_k is computed using the whole dataset
There are settings where only a subset of the data can be used
Online learning
: the instances (
x
_n, y_n) appear one at a time
Large datasets
: computing the exact value of
g
_k would be expensive, if not impossible
SGD
The gradient
g
_k is computed using a subset of the instances available
The word
stochastic
refers to the fact that the value for
g
_k will depend on the subset of the instances chosen for computation (and it is important to randomly sample)
In the stochastic setting, a better estimate can be found if the gradient is computed using
where
is the cardinality (number of samples in the set) of S, and
g
_(k, i) is the gradient at iteration k computed using the instance (
x
_i, y_i)
This setting is called mini-batch gradient descent
Step size
Choosing the value of η is particularly important in SGD since there is no easy way to compute it
Usually the value of η will depend on the iteration k, η_k
It should follow the
Robbins-Monro
conditions
Various formulas for η_k can be used
where τ0 slows down in early iterations and k ∈ (0.5, 1]
Regularisation
What is regularisation?
A technique used for preventing overfitting in a predictive model.
Consists of adding a term, a regulariser, to the objective function that encourages simpler solutions
With regularisation, the objective function for linear regression would be h(
w
) = E(
w
) + λR(
w
), where R(
w
) is the regularisation term and λ the regularisation parameter
In the expression for h(
w
), we can use NLL(
w
) instead of E(
w
)
Different types of regularisation
The objective function for linear regression would be h(
w
) = E(
w
) + λR(
w
), where R(
w
) follows as
where
and
If 0 < α < 1, we get the elastic net regularisation.
If α = 0, we get ℓ2 regularisation.
If α = 1, we get ℓ1 regularisation.
Ridge regression or ℓ2 regularisation
In ridge regression, α = 0,
It can be shown that an optimal solution for
w*
is given as
w*
= (
X
^T
X
+ λ
I
)^-1
X
^T
y
Notice that we can also use an iterative procedure for optimising h(
w
) either through batch gradient descent, SGD or mini-batch SGD
Logistic regression
Probabilistic classifier
A logistic regression model is an example of a probabilistic classifier
Let
x
∈ R^D represent a feature vector and y the target value
For a binary classification probelm we can use y ∈ {0, 1} or y ∈ {-1, 1}
We model the relationship between y and
w
using a Bernoulli distribution
Bernoulli distribution
A Bernoulli RV Y is an RV that can only take two possible value, e.g. tossing a coin
A Bernoulli distribution is a probability distribution for Y, expressed as
, where µ = P(Y = 1).
The expression can be summarised in one line using
How are y and
x
related in logistic regression?
The target feature y follows a Bernoulli distribution p(y|
x
) = Ber(y|µ(
x
)). The probability explicitly depends on
x
.
In logistic regression, the probability µ(
x
) is given as
where σ(z) known as the logistic sigmoid function.
We then have p(y|
w
,
x
) = Ber(y| σ(
w
^T
x
))
The logistic sigmoid function σ(z)
Recall
If z → ∞, σ(z) = 1. If z → −∞, σ(z) = 0. If z = 0, σ(z) = 0.5
The logistic sigmoid function σ(
w
^T
x
)
We have z =
w
^T
x
. For simplicity, assume
x
= x, then σ(wx)
Recall
So
Decision boundary
After the training phase, we will have estimator for
w
For a test input vector
x*
, we compute p(y=1|
w
,
x*
) = σ(
w
^T
x*
). This will give us a value between 0 and 1
We define a threshold of 0.5 to decide to which class we assign
x*
With this threshold we induce a linear decision boundary in the input space
Cross-entropy error function
Let
with
X
= [
x
1, ...,
x
N]^T and
y
= [y1, ..., yN]^T
Assuming IID observation
The cross-entropy function or negative log-likelihood is given as
which can be minimised with respect to
w
Gradient of NLL(
w
)
It can be shown that the gradient
g
(
w
) of NLL(
w
) is given as
where
σ
= [σ(
w
^T
x
1), ..., σ(
w
^T
x
N)]^T
Optimisation and regularisation
SGD methods described above can be applied to find the parameter vector
w
that minimises the negative log-likelihood NLL(
w
) in logistic regression
Likewise, all the regularisation techniques we saw before, can also be added to the cross-entropy error function.
Feedforward Neural Networks
Logistic Regression
A unit in a neural network computes a linear function of its input and is then composed with a non-linear
activation
function.
For LR, the non-linear activation is
sigmoid
which makes a linear separating surface
Multi-layer perceptron
Extends LR model by combining 3 logistic regressions and adds a hidden layer, so that there are three in total: input later, hidden layer and output layer
We introduce weigh coefficients, bias and then define the linear and non-linear transformation and the activation function
Solve using MLP
Notation
For each link, we have bias and weight terms.
Superscript always refers to the layer number
z is the linear output and a is the activation output
Vector and matrix notation:
There is a long example of derivations for the NLL and a toy example in the notes
Feedforward Neural networks
Many layers formed from MLPs, called feed forward because in each layer, the units are only connected to the next layer
We can easily extend MLP architecture to any finite number L of layers, in which layer l compuutes the function
where h_l is the non-linear activation in layer l.
Forward equations:
Backward pass to compute gradients
Back propagation: output layer
and given error E(
a
^L,
y
), we want
If there are n_L units in layer L,
in the N_L, n_L Jacobian matrix:
If h_L is applied element-wise, then this matrix is diagonal
Back propagation: hidden layer
Compute the gradient with respect to
z
^l
Gradients with respect to parameters
Knowing the transformation in layer l,
W
^l and
b
^l are parameters to be optimised, so we must compute the gradients with respect to them
Forward equations
Back-propagation equations
Computational questions
What is the running time to compute the gradients?
As many matrix multiplications as there are fully connected layers, performed twoce during forward and backward pass
What is the space requirement?
Need to store vectors
a
^l,
z
^l and
for each layer
Can we process multiple examples together?
Yes, if we minibatch, we perform tensor operations
Make sure that all parameters fit in GPU memory
Training Deep Neural Networks
Training Deep Neural Networks
SGD: a mini-batch of samples
A complete pass through the whole training set is known as a training epoch
Points at which the gradient vanished are called stationary points
A minimum that corresponds to the smallest value of the error function across the whole of w-space is a
global minimum
Any other minima corresponding to higher values of the error function is a
local minima
Parameter initialisation
Have a significant effect on convergence and the generalisation performance
Symmetry breaking: the same initialised values will lead to the same update of parameters
Consider either a uniform distribution [-ϵ, ϵ] or a zero-mean Gaussian N(0, ϵ^2)
The choice of ϵ is important. Consider a network layer:
With N(0, ϵ^2) initialisation and fixed variance of z_i^(l-1):
He initialisation: set
Normalisation
Removes the need to deal with extremely large or small values
Data normalisation: for
continuous inputs
, to rescale and recentre the input value
Batch normalisation
For
each nonlinear unit
a_i = h(z_i), to normalise the pre-activations within a mini-batch of size K:
The mean and variance are computed across the mini-batch separately for each hidden unit
Commonly used in convolutional neural networks
Layer normalisation
To normalise across the
hidden-unit values for each data point
separately
The mean and variane are computed across the hidden units separately for each data point
Commonly used in recurrent neural networks and transformers
Gradient momentum
Stochastic gradient descent (one data point at a time)
Can lead to oscillations for fixed step-sizes
A way to avoid this is to add inertia to the motion through weight space and smooth out the oscillations
where µ is called the momentum parameter.
Adaptive gradients: RMSProp and Adam
Optimal learning rate depends on the local curvature of error surface
AdaGrad
(Adaptive gradient): to reduce each learning rate by using the accumulated sum of squared gradients
RMSProp
(Root mean squared propagation): moving average
Adam
(Adaptive moments): moving average for both
Regularisation: weight decay
Regularisation in linear models:
Weight decay in machine learning: encourages weight values to decay towards zero, unless supported by the data
AdamW: Adam with weight decay (to decouple the weight decay from the l2 regularisation)
Regularisation: residual connection
Training networks with a large number of layers is difficult
Shattered gradients:
Residual connections:
Loss landscape with/wothout residual connections
a) visualisation of the error surface for a network with 56 layers
b) the same network with the inclusion of residual connections
Regularisation:dropout
Dropout
is one of the most effective forms of regularisation and is widely used in applications
We delete nodes from the network, including their connections, at random during training
Applied to hidden and input nodes, but not outputs
Language Models
A language model assigns a probability to a sequence of words such that
Given the observed training text, how probable is this new utterance?
Can compare different orderings of words (e.g. translations) or choice of words (e.g. speech recognition)
Employ the chain rule to decompose the joint probability into a sequence of conditional probabilities:
Can model complex joint distributions by learning conditional distributions over the next word (xn) given the history of words observed (x1, x2, ..., xn)
The simple objective of modelling the next word given the observed history contains much of the complexity of natural language understanding. This is also called Next Token Prediction (NTP).
Consider predicting the extension of the utterance: p(.|There she built a)
With more context, we are able to use our knowledge of language and the world to heavily constrain the distribution over the next word: p(.| Alice went to the beach. There she built a)
Language modelling is a time series prediction problem in which we must be careful to train on the past and test on the future.
Recurrent Neural Networks
RNNs
The unrolled recurrent network is a direct acyclic graph
Unrolled step-by-step
Could we build an RNN for the next word prediction p(.| There she built a )
Prepend the senetence with a start symbol <s>
Formulate the input sequence as: (<s>, There, she, built, a)
Forward pass
We can run forward equations as usual, and run the error function, which is the mean of all the errors in this case, where N is the sequence length
Back-Propagation Through Time
The forward equations means that the error, as above, is
Back propagation (red edges) gives gradients for parameter update:
The gradient update is called
Back-Propagation Throigh Time (BPTT)
: the derivatives at timestep n depend on those at future timesteps.
If we break dependencies after a fixed number of timesteps we get
Truncated BPTT
Exploding and Vanishing Gradient
An RNNs Model needs to discover and represent long-range dependencies
While an RNN model can represent these dependencies in theory, can it learn them?
Consider the path of partial derivatives linking a change in En to changes in h1:
Assume V decomposes into Vx and Vh and g is an element-wise activation:
Calculate the gradients:
For the repeated multiplication of Vh, if the largest eigenvalue of Vh is
1, the gradient can be propagated
greater than 1, the product will grow exponentially (explode)
< 1, the product shrinks exponentially (vanishes)
The most popular solution: to change the network architecture to include gated units.
Transformers
Issues with recurrent neural networks
Lack of parallelisability: forward and backward passes have O(sequence length) unparallelizable operations
GPUs can perform a bunch of independent computations at once
But future RNN hidden states can't be computed in full before past RNN hidden states have been computed, which inhibits training on very large datasets
Attention
Consider the following sentences:
I swam across the river to get to the other back
I walked across the road to get cash from the bank
Task: to determine the appropriate interpretation of 'bank'
Attention
: a neural network should attend to/more heavily rely on specific words from the rest of the sequence.
Input representation: word embedding
How to convert the words into a numerical representation?
One-hot encoding
Define a fixed dictionary and introduce vectors of length equal to dictionary size
To encode the k-th word with the vector x_n, having 1 in position k and 0 elsewhere
Results in vectors of very high dimensionality if the dictionary is large
Word embedding
Defined a matrix E of size D x K: D the dimensionality of the embedding space, K the dimensionality of the dictionary: v_n = Ex_n, where vector v_n is the corresponding column of the matrix E (can be learned, e.g. word2vec)
Learned embedding space often has an even richer semantic structure
As a pre-processing step: part of the overall end-to-end training
One sentence after embedding:
A set of vectrs {x_n} of dimensionality D, where n = 1, ..., N
Vectors are tokens, which might correspond to a word or byte pair
To combine data vectors into a matrix X of dimensionas N x D:
nth row comprises the token vector
x
^T_n
This matrix represents one set of input tokens
For most applications, we will require a data set containing many sets of tokens
Self-attention
Task: to design such input transformation that:
Takes
X
as input and outpust a transformed matrix
Y
of the same dimensionality,
Y
= TransformedLayer[
X
]
Can apply multiple times in succession to construct deep networks
Question: Given a set of tokens x1, ..., xN in an embedding space, how to map it to another set y1, ..., yN having the same token number but in a new embedding space?