Please enable JavaScript.
Coggle requires JavaScript to display documents.
Machine Learning Technologies - Coggle Diagram
Machine Learning Technologies
Basic ML
Regression
Linear Regression
Simple Linear Regression
Multiple Linear Regression
Regularisation
L1: Lasso Regularisation
Adds penalty equivalent to the sum of the asbolute values of coefficients.
L2: Ridge Regression
Adds penalty equivalent to the sum of the squares of the magnitudes of coefficients.
This prevents negative penalties
L3: Elastic Net (L1 + L2)
Classification
Generative
Full Bayes
Naive Bayes
Maximum Likelihood Estimation
Discriminative
Logistic Regression
Multi-class Classification
Performance Measures
Bias vs Variance
Bias
How well the model predicted the value over the actual value.
Variance
How the model performed on the test vs training data.
Intermediate ML
SVMs
Non-Linear
Linear
Ensemble Models
Decision Trees
Gini Index/Score
Measures the class impurity of a node.
A pure, gini=0, node means all instances in it belong to the same class.
Algorithm
Choose the feature that provides the lowest of the weighted sum of the Gini score of the children nodes.
Repeat until reaching leaf node
Entropy/Shannon Entropy
The number of questions you need to ask on average to get to the data.
Evaluation
Advantages
Less effort for data prep (no normalisation/scaling)
Whitebox model so inherently interpretable
Can easily achieve 0% error rate on training data
Disadvantages
High training time
High variance -> overfitting
Instability -> sensitivity to variation in dataset
Regularisation Hyperparameter
Max tree depth
Min samples for node before split
Min samples a leaf node must have
Bagging
Bootstrap Aggregating
Motivation
A complex model tends to have large variance and overfit the data.
We can potentially average complex models to reduce variance.
Random Forest
Forest
Train a group of decision trees and aggregate their votes using majority vote, or average, etc...
Random
Random training sets
Feature randomness
e.g. k=sqrt(n)
Trees learn to be diverse (uncorrelatedness)
Decreases training time
Boosting
Comparison with Random Forest
Tree Size
: RF=random, B=stumps (one root, two leaves)
Voting
: RF=Equal vote, B=Weighted Vote
Tree Creation
RF=Independent, B=Ordered
Guarantee
If your ML algorithm can produce a classifier with error rate smaller than 50% on training data, you can obtain ~0% error rate classifier after boosting.
It is important each classifier is complementary, not similar to those before.
How to obtain different classifiers?
Train on different training datasets:
Resample
Reweight
Adaboost
Adaptive Boosting
Concept
Train classifier n+1 on the set of instances that fails classifier n
The failure doesn't have to be hard, instead, you can use the same instances weighted by how much they failed.
Aggregation Functions
Uniform Weight
Non-uniform Weight
Boostrapping
A resampling method that draws several samples with replacement.
Randomly emphasises particular items in the dataset.
Dimensionality Reduction
Curse of Dimensionality
PCA
Advanced ML
Neural Networks
Limitations of Logistic Regression
Only capable of separating linearly separable data.
Can use a feature transformation but these can be difficult to find
Cascading multiple logistic regressions automatically incorporates both feature transformation and classification
Composition
Neurons
Components
Set of inputs
Input weights
Bias
Activation function to combine the sum of the inputs, weights, and bias.
The individual elements of a neural network such as the perceptron, or logistic regression.
3 Sections
Input Layer
n hidden Layers
Output Layer
Activation Functions
Sigmoid
ReLU
Rectified Linear Unit
Fast to compute
Solves vanishing gradient problem
Maxout
Combines 2 or more neurons and outputs the max of their outputs.
ReLU is technically a type of maxout where the weights of all the feeding neurons are 0 except one.
Architectures
Fully Connected Feed-Forward Networks
(FFNNs)
Each neuron in layer n outputs to every neuron in layer n+1
Mutli-Class Classification: Softmax
The output probability for each class is equal to the exponent of the input out of the sum of the exponent of all inputs.
Drop Out
During training, each neuron has p chance to drop out.
In testing, each weight is mulitplied by 1-p.
This is to make up for the fact that the output in training was built from input of only 1-p rows.
Loss Functions
Categorical Cross Entropy
The negative sum of target output x log(actual output)
Total Loss
The sum of the cross entropy losses for every item in the training data.
The idea of learning is to minimise total loss.
Updating Weights
Gradient Descent
Update network parameters by calculating the gradient of the loss function.
Backpropagation
An efficient way to compute dL/dw
Forward Pass
Compute dz/dw for all parameters
Backward Pass
Compute dC/dz for all activation function inputs z.
Combine dz/dw and dC/dz using chain rule to get dC/dw
dL/dw = the sum of dC/dw
Alternative Gradient Descent Methods
Stochastic Gradient Descent
In each iteration, update the network parameters based on the gradient of the total loss function, calculated using a single data point that has been randomly selected.
Runs significantly faster than normal gradient descent, so is the dominant strategy.
Adam Optimiser is an alternative that combines the best properties of AdaGra and RMSProp
Upsides
Frequent insights into performance and rate of improvement.
Simplest to understand and implement
Faster learning on SOME problems
Noisy update process CAN allow the model to avoid local minima
Downsides
Longer to train on large datasets due to frequent updates
Noisy gradient signal causing model error to jump around
Can be hard to settle on error minimum
Batch Gradient Descent
Calculate the error for each training instance, but only updates the model after all training examples have been evaluated.
Upsides
Fewer updates is more efficient
More stable error gradient
Better for parallel processing
Downsides
Premature convergence
Additional complexity in accumulating errors across training examples
Some implementations require the entire training dataset in memory
Model updates and training speed may become very slow for large datasets
Mini-Batch Gradient Descent
Splits each batch into mini-batches, updating the weights after each.
Upsides
More frequent model updates allows for more robust convergence than batch
Batched updates more efficient than stochastic gradient descent
Doesn't require all training data in memory at once
Downsides
Mini-batch requires the configuration of an additional mini-batch size hyperparameter
Error information must be accumulated across mini0batches like batch gradient descent.
Problems
Vanishing Gradients
Problem
As more layers using certain activation functions are added to neural networks, the gradients of the loss function approaches zero, making the network hard to train.
Cause
Certain activation functions, like the sigmoid function, squishes a large input space between 0 and 1. Therefore, a large change in the input of the sigmoid function will cause a small change in the output. Hence, the derivative becomes small.
I.e., beyond x=|5| in the sigmoid function dx becomes ~0.
Significance
Backpropagation multiples derivatives layer by layer from last to first.
When using activation functions like sigmoid, this process multiplies many small derivatives togethers.
Thus, the gradient decreases exponentially.
A small gradient means that the weights and biases of the intial layers will not be updated effectively.
Solution
Activation functions such as ReLU which doen't cause a small derivative.
Residual networks which provide residual connections straighto earlier layers, skipping squashing activation functions.
Batch normalisation layers
Overfitting
Solutions
Dropout
At each training step, every neuron (excluding output layer) has a probability p of being ignored within that training step.
p=0.5 typically
Early Stopping
Train on a mini-batch, then run on validation set, before training again on the other batch. Stop training as soon as error rate on validation set starts increasing (performance starts dropping).
Data Augmentation
Artificially generate new data points from existing ones. Add some noise to existing data but keep the same label.
Regularisation
Theories
Modularisation
A deep network enable modularisation where the network approximates a set of simplier functions and combines them.
A shallow network approximates a single, extremely complex function.
Universality Theorem
Any continuous function that maps reals to reals can be realised by a network with one hidden layer, given enough neurons.
(However deeper networks are more effective).
CNNs
Layers
A CNN must start with a convolutional layer and end with an FC layer but can have any combination of conv and pooling layers between.
Convolutional Layer
Components
Input data
Feature detector (kernel, template, filter)
Feature map
Activation function (typically ReLU)
Feature Detector
A 2D array of weights which are convolved across the image
Outputs a feature map
Fixed kernel weights =
parameter sharing
Feature Map
aka activation map, convolved feature
Output array of the feature detector
Local connectivity = each value in the feature map does not have to connect to each pixel value in the input image
Hyperparameters
Number of Filters
This affects the depth of output. E.g. 3 filters would yield 3 faeture maps
Stride
The distance that the kernel moves over the input matrix. A larger stride yields a smaller output
Zero Padding
Padding is used when the filters do not fit the input image.
Valid padding = No padding, last con is dropped if dimensions don't align
Same padding = Ensures output is same dimension as input
Full padding = Increases output size by adding zeros to the border of input
Pooling Layer
Benfits
Reduces complexity
Improves efficiency
Limits risk of overfitting
Downsides
Information loss
Description
Dimensionality reduction layers which reduce the number of parameters in the input.
Types
Max Pooling: Selects the pixel with max value in filter
Avg Pooling: Outputs average value of receptive field
Fully-Connected (FC) Layer
Performs classification based on the features extracted through the previous layers and their different filters. Tend to use softmax activation instead of ReLU
Ship Connections
(shortcut connections)
The signal feeding into a layer is also added to the output of a layer located much higher up the stack
RNNs
Concept
Basic Idea
The output of hidden layers are stored in memory.
Loops are allowed!
Learning
Backpropagation through time (BPTT)
Problems
Exploding Gradients
Solved by clipping
Long-Term Dependence
LSTMs
LSTM Variants
Meta ML
ML Lifecycle
Main Challenges in ML