Please enable JavaScript.

Coggle requires JavaScript to display documents.

DL 2019 Exam (Math symbols (f = y = predicted label, h = activation after…

- - - - makes sure all class labels, f_j, can be predicted and that only the given class labels can result from this function
    - - any negative log-likelihood loss is a cross entropy between the empirical/data distribution and the model distribution
  - - - Gradient descent moves along the negative gradient
    - - Updates weights based on this gradient
      - learning rate
        
        small -> too slow
        
        large: oscillate and can diverge
        
        momentum method (Polyak 1964)
        
        ball moving the error surface, mass to accelerate/decelerate
        
        damps oscillations in high curvature and speeds up with consistent gradient
      - Backpropagation training
        
        Chain rule of derivatives
  - - - not waste computations on redundant datasets
        
        introducing noise to form
        stochastic gradient descent (SGD)
        
        may have regularization effect
        
        more difficult to end up in sharp local optimum, which correspond to overfitted solutions
      - Practical considerations
        
        annealing learning rate towards the end to reduce noise effect
        
        Require balancing for classes
        
        Shuffling data to avoid bias toward any mini-batch split
        
        Larger batch size reduces noise in the gradient estimates
    - - Mean to zero and scale to unit variance
      - more thorough method using PCA
    - - different init values for each neuron make them react differently to input and learn to detect different features from data
      - Weights impact to output variance
        
        variance can grow too quickly with weights too large
        
        variance can decay if weights too small
        
        For RNN, problem becomes exponential because the weights are shared
    - - Shift and scale the inputs of intermediate layer using
        the mean and std computed from each mini-batch
        
        two extra parameters for controlling the mean and
        variance of layer's inputs
    - - Use momentum
      - Divide the gradient by a running average
        of its recent magnitude (RMSProp)
        
        Adapt learning rate for each parameter
        
        Escapes from plateaus of tiny gradients
      - Combine both (Adam)
  - - - Flat local minima for better generalization
      - Avoid plateaus where outputs becomes independent of inputs
    - - Fixed by weight init and batch normalization
    - - Fixed by skip connections (ResNet)
- - - - Too many parameters considering the data
    - - Use validation set to evaluate performance
  - - - Stop when validation set loss starts to deteriorate
      - Keeps solution close to initialization
    - - Reduce network size
        
        RoT: To keep the risk of overtting low, the number of examples
        should be ten times larger than the number of parameters.
      - Parameter sharing
        
        CNNs and RNNs
      - Weight decay
        
        Penalty term to training cost
        
        L2 (Ridge) regularization
        
        Equivalent to Gaussian prior
        
        L1 (Lasso) regularization
        
        induces sparsity -> many params go to zero
        
        Equivalent to Laplace prior
      - Sparse representations
    - - Data augmentation
        
        Generate more labelled data by transformations
        
        Classification network also learns to be invariant to such transformations
      - Injecting noise
        
        Similar to data augmentation, learns resilience against such noise
      - Adversial training
        
        GANs
        
        How small change change in the input can produce different output?
      - Auxiliary tasks
    - - Dropout
        
        At training, randomly delete each hidden node with probability p
        
        Injecting noise
        
        training ensemble of models with shared weights
      - Probabilistic treatment
        
        Bayesian methodology
        
        Combine predictions of all possible models, weighting them by model evidence
        
        Find a selected family of distributions that is closest approximation of the true posterior distribution
        
        This effectively reduces the parameters to mean and variance of the selected distribution family
        
        Related to minimizing Kullback-Leibler (KL) divergence
        
        Sampling
        
        Draw samples (w) from posterior distribution or approximation
        
        Each sample defines one NN
        
        Creates ensemble of NNs
        
        Stein operator to compute gradient of KL divergence
      - Bagging/model averaging
        
        Train several models -> take average of their outputs
  - - - Need for implicit regularization techniques such batch normalization and stochastic gradient descent
  - - - Grid search
        
        Evaluates all combinations
        
        Unnecessary, computationally too complex
      - Random search
        
        Random combinations
        
        More convenient and faster than grid search, while not necessarily finding the best choice
- - - - “Every time I fire a linguist, the performance of the speech recognizer goes up"
    - - Able to use domain knowledge to build rules which help narrow down the problem
  - - - Each layer learns more complex features
    - - Labeled datasets too small
      - Computers too slow
      - Wrong way to initialize weights
      - Wrong type of non-linearity
  - - - linear binary classifier
        
        taught how to use weights and biases with some function
    - - associative memory
        
        positively correlated weights increase/decrease together
    - - classification machine with weight updates
        
        Approximating conditional logic and therefore simple functions
    - - Limitations of Perceptron
        
        Lead to multilayered neural nets/perceptrons (MLP) to solve more complex problems
    - - Recurrent neural network
        
        Minimize energy function to fit data into predetermined patterns
    - - stochastic Hopfield network
    - - topographical mapping, use in data visualization
    - - Implemented on computers by Seppo Linnainmaa (1970)
      - Werbos (1982) and later popularized by Rumelhart, Hinton & Williams (1986) for training deep NN
    - - neocognitron by Fukushima (1980)
      - LeCun (1989) CNN classify handwritten characters
      - Waibel (1989) time-delay NN to audio, a moving window
    - - sequential data processing, but difficult to train with backpropagation
        
        vanishing/exploding gradients
    - - Hochreiter and Schidhuber (1997) designed to solve vanishing/exploding gradient of RNNs
    - - Hinton (2006) deep is better with pretrained layers
        
        Restricted Boltzmann machine for training each layer
    - - AlexNet (2012) with deep CNN
      - Residual net (ResNet) allowed to go deeper