Please enable JavaScript.

Coggle requires JavaScript to display documents.

Machine Learning Technologies - Coggle Diagram

- - - - Simple Linear Regression
      - Multiple Linear Regression
      - Regularisation
        
        L1: Lasso Regularisation
        Adds penalty equivalent to the sum of the asbolute values of coefficients.
        
        L2: Ridge Regression
        Adds penalty equivalent to the sum of the squares of the magnitudes of coefficients.
        This prevents negative penalties
        
        L3: Elastic Net (L1 + L2)
  - - - Full Bayes
      - Naive Bayes
      - Maximum Likelihood Estimation
    - - Logistic Regression
      - Multi-class Classification
  - - - Bias
        How well the model predicted the value over the actual value.
      - Variance
        How the model performed on the test vs training data.
- - - - Gini Index/Score
        Measures the class impurity of a node.
        A pure, gini=0, node means all instances in it belong to the same class.
      - Algorithm
        
        Choose the feature that provides the lowest of the weighted sum of the Gini score of the children nodes.
        
        Repeat until reaching leaf node
      - Entropy/Shannon Entropy
        The number of questions you need to ask on average to get to the data.
      - Evaluation
        
        Advantages
        
        Less effort for data prep (no normalisation/scaling)
        
        Whitebox model so inherently interpretable
        
        Can easily achieve 0% error rate on training data
        
        Disadvantages
        
        High training time
        
        High variance -> overfitting
        
        Instability -> sensitivity to variation in dataset
      - Regularisation Hyperparameter
        
        Max tree depth
        
        Min samples for node before split
        
        Min samples a leaf node must have
    - - Motivation
        A complex model tends to have large variance and overfit the data.
        We can potentially average complex models to reduce variance.
    - - Forest
        Train a group of decision trees and aggregate their votes using majority vote, or average, etc...
      - Random
        
        Random training sets
        
        Feature randomness
        
        e.g. k=sqrt(n)
        
        Trees learn to be diverse (uncorrelatedness)
        
        Decreases training time
    - - Comparison with Random Forest
        Tree Size: RF=random, B=stumps (one root, two leaves)
        Voting: RF=Equal vote, B=Weighted Vote
        Tree Creation RF=Independent, B=Ordered
      - Guarantee
        If your ML algorithm can produce a classifier with error rate smaller than 50% on training data, you can obtain ~0% error rate classifier after boosting.
      - It is important each classifier is complementary, not similar to those before.
      - How to obtain different classifiers?
        Train on different training datasets:
        
        Resample
        
        Reweight
      - Adaboost
        Adaptive Boosting
        
        Concept
        Train classifier n+1 on the set of instances that fails classifier n
        
        The failure doesn't have to be hard, instead, you can use the same instances weighted by how much they failed.
        
        Aggregation Functions
        
        Uniform Weight
        
        Non-uniform Weight
- - - - Neurons
        
        Components
        
        Set of inputs
        
        Input weights
        
        Bias
        
        Activation function to combine the sum of the inputs, weights, and bias.
        
        The individual elements of a neural network such as the perceptron, or logistic regression.
      - 3 Sections
        
        Input Layer
        
        n hidden Layers
        
        Output Layer
      - Activation Functions
        
        Sigmoid
        
        ReLU
        Rectified Linear Unit
        
        Fast to compute
        
        Solves vanishing gradient problem
        
        Maxout
        Combines 2 or more neurons and outputs the max of their outputs.
        
        ReLU is technically a type of maxout where the weights of all the feeding neurons are 0 except one.
    - - Fully Connected Feed-Forward Networks (FFNNs)
        Each neuron in layer n outputs to every neuron in layer n+1
      - Mutli-Class Classification: Softmax
        The output probability for each class is equal to the exponent of the input out of the sum of the exponent of all inputs.
      - Drop Out
        
        During training, each neuron has p chance to drop out.
        
        In testing, each weight is mulitplied by 1-p.
        
        This is to make up for the fact that the output in training was built from input of only 1-p rows.
    - - Categorical Cross Entropy
        The negative sum of target output x log(actual output)
      - Total Loss
        The sum of the cross entropy losses for every item in the training data.
        The idea of learning is to minimise total loss.
    - - Gradient Descent
        Update network parameters by calculating the gradient of the loss function.
      - Backpropagation
        An efficient way to compute dL/dw
        
        Forward Pass
        
        Compute dz/dw for all parameters
        
        Backward Pass
        Compute dC/dz for all activation function inputs z.
        
        Combine dz/dw and dC/dz using chain rule to get dC/dw
        
        dL/dw = the sum of dC/dw
      - Alternative Gradient Descent Methods
        
        Stochastic Gradient Descent
        In each iteration, update the network parameters based on the gradient of the total loss function, calculated using a single data point that has been randomly selected.
        
        Runs significantly faster than normal gradient descent, so is the dominant strategy.
        
        Adam Optimiser is an alternative that combines the best properties of AdaGra and RMSProp
        
        Upsides
        
        Frequent insights into performance and rate of improvement.
        
        Simplest to understand and implement
        
        Faster learning on SOME problems
        
        Noisy update process CAN allow the model to avoid local minima
        
        Downsides
        
        Longer to train on large datasets due to frequent updates
        
        Noisy gradient signal causing model error to jump around
        
        Can be hard to settle on error minimum
        
        Batch Gradient Descent
        Calculate the error for each training instance, but only updates the model after all training examples have been evaluated.
        
        Upsides
        
        Fewer updates is more efficient
        
        More stable error gradient
        
        Better for parallel processing
        
        Downsides
        
        Premature convergence
        
        Additional complexity in accumulating errors across training examples
        
        Some implementations require the entire training dataset in memory
        
        Model updates and training speed may become very slow for large datasets
        
        Mini-Batch Gradient Descent
        Splits each batch into mini-batches, updating the weights after each.
        
        Upsides
        
        More frequent model updates allows for more robust convergence than batch
        
        Batched updates more efficient than stochastic gradient descent
        
        Doesn't require all training data in memory at once
        
        Downsides
        
        Mini-batch requires the configuration of an additional mini-batch size hyperparameter
        
        Error information must be accumulated across mini0batches like batch gradient descent.
    - - Vanishing Gradients
        
        Problem
        As more layers using certain activation functions are added to neural networks, the gradients of the loss function approaches zero, making the network hard to train.
        
        Cause
        Certain activation functions, like the sigmoid function, squishes a large input space between 0 and 1. Therefore, a large change in the input of the sigmoid function will cause a small change in the output. Hence, the derivative becomes small.
        I.e., beyond x=|5| in the sigmoid function dx becomes ~0.
        
        Significance
        
        Backpropagation multiples derivatives layer by layer from last to first.
        
        When using activation functions like sigmoid, this process multiplies many small derivatives togethers.
        
        Thus, the gradient decreases exponentially.
        
        A small gradient means that the weights and biases of the intial layers will not be updated effectively.
        
        Solution
        
        Activation functions such as ReLU which doen't cause a small derivative.
        
        Residual networks which provide residual connections straighto earlier layers, skipping squashing activation functions.
        
        Batch normalisation layers
      - Overfitting
        
        Solutions
        
        Dropout
        At each training step, every neuron (excluding output layer) has a probability p of being ignored within that training step.
        p=0.5 typically
        
        Early Stopping
        Train on a mini-batch, then run on validation set, before training again on the other batch. Stop training as soon as error rate on validation set starts increasing (performance starts dropping).
        
        Data Augmentation
        Artificially generate new data points from existing ones. Add some noise to existing data but keep the same label.
        
        Regularisation
    - - Modularisation
        
        A deep network enable modularisation where the network approximates a set of simplier functions and combines them.
        
        A shallow network approximates a single, extremely complex function.
      - Universality Theorem
        Any continuous function that maps reals to reals can be realised by a network with one hidden layer, given enough neurons.
        (However deeper networks are more effective).
  - - - Convolutional Layer
        
        Components
        
        Input data
        
        Feature detector (kernel, template, filter)
        
        Feature map
        
        Activation function (typically ReLU)
        
        Feature Detector
        
        A 2D array of weights which are convolved across the image
        
        Outputs a feature map
        
        Fixed kernel weights = parameter sharing
        
        Feature Map
        
        aka activation map, convolved feature
        
        Output array of the feature detector
        
        Local connectivity = each value in the feature map does not have to connect to each pixel value in the input image
        
        Hyperparameters
        
        Number of Filters
        This affects the depth of output. E.g. 3 filters would yield 3 faeture maps
        
        Stride
        The distance that the kernel moves over the input matrix. A larger stride yields a smaller output
        
        Zero Padding
        Padding is used when the filters do not fit the input image.
        
        Valid padding = No padding, last con is dropped if dimensions don't align
        
        Same padding = Ensures output is same dimension as input
        
        Full padding = Increases output size by adding zeros to the border of input
      - Pooling Layer
        
        Benfits
        
        Reduces complexity
        
        Improves efficiency
        
        Limits risk of overfitting
        
        Downsides
        
        Information loss
        
        Description
        Dimensionality reduction layers which reduce the number of parameters in the input.
        
        Types
        
        Max Pooling: Selects the pixel with max value in filter
        
        Avg Pooling: Outputs average value of receptive field
      - Fully-Connected (FC) Layer
        Performs classification based on the features extracted through the previous layers and their different filters. Tend to use softmax activation instead of ReLU
      - Ship Connections
        (shortcut connections)
        The signal feeding into a layer is also added to the output of a layer located much higher up the stack
  - - - Basic Idea
        
        The output of hidden layers are stored in memory.
        
        Loops are allowed!
      - Learning
        Backpropagation through time (BPTT)
    - - Exploding Gradients
        Solved by clipping
      - Long-Term Dependence