Please enable JavaScript.
Coggle requires JavaScript to display documents.
Neural Networks - Deep Learning - Coggle Diagram
Neural Networks - Deep Learning
Standard Architectures
MLP
Vanilla RNN
LSTM - GRU
Conv Neural Networks
VGG
ResNet
Attention Mechanisms
Regularization
Weight Decay
L1 Regularization:
sparse weights (many weights to zero)
L2 - Regularization
: just right
Dropout
Inverted Dropout Implementation (Use a mask and then divide to keep_prob in order to keep the expected activation in the forward steps)
Keep probability depends on the weight size per layer.
Not use dropout during test time or debugging
Data Augmentation
Early Stop
Normalization
Batch Normalization
Allow to train deeper networks by avoiding the
covariate shift
Normalize and standardize the data as if we were working with the data, but here we scale and shift the layer output z (or activations a in some cases) with two learnable parameters gamma and beta
During test time we don't shutdown the BN
, instead we have to keep a track of the mean and std from batches using an exponential weighted average.
Layer Normalization?
Weight Normalization
: randn() normal distribution help us to avoid extreme rand values (e.g. with sigmoid having more center values speed the convergence)
Data Normalization and Standarization
: common data normalization to avoid vanishing or exploting gradients (also it helps to have a faster iterative process)
Optimizers
SGD and mini-batch GD
: standard algorithm. With mini-batch shuffled the data each epoch.
Momentum SGD
: Faster learning using an exponentially weighted average over the previous gradients for the update step
Adam
: Combination of Momentum and RMSprop into one algorithm. Commonly, its hyperparameters beta1 and beta2 are not stunned, we keep the default values.
RMSprop
: Faster learning using an exponentially weighted average over the square previous gradients for the update step (it controls the update steps in order sense).
Exponential Weighted Average
: concept to understand the new algorithms
Special Topics
Reparametrization Trick.
Review
Bias - Variance (ML Recipe)
High Bias?
More Complex Models
Train Longer
Advanced Optimization
Feature Engineering
High Variance?
Early Stopping
Regularization: L1, L2, Dropout
Reduce Complexity of Model
More data
Forward - Backward Propagation
Computational Graph and Chain Rule
Vectorization and Block Structure
Gradient Descent
Gradient Checking with Numerical Approximation
Cross Validation
Train/Val/Test
K-fold
Logistic Loss
Intuition and Maximum Likelihood
Softmax Generalization for k classes.
Transfer Learning
Task A and B have the same input X
You have a lot more data for task A than task B
Low-level features from A could be helpful for learning B
Multi-task Learning
(Less common)
Multi-task logistic log. Unlike softmax regression, one image can have
multiple labels
If the data is partially labeled the loss also works
Training on a set of tasks that could benefit from having shared lower-level features
Usually: Amount of data you have for each task is quite similar. You can even focus in one task
Entropy, Cross Entropy and KL Divergence
Non-Linear Activations
Sigmoid
- Output Layer
Tangent
- Activation mean close to zero (centering activations)
ReLU
- Faster Training
Leaky ReLU
- Faster Training and allow few negative values
Weight Initialization
Vanishing and Exploting Gradients
Intuition about weight initialization
He Initialization (ReLU)
Xavier Initialization (tanh)
Glorot Initalization
Symmetry Break - Random Initialization
randn() normal distribution help us to avoid extreme rand values (e.g. with sigmoid having more center values speed the convergence)