Deep Learning
1. Intro
Computer Vision mimics the Human Visual System.
It's the center block of robotic intelligence.
First defined in 60s in AI groups
Why DL in CV is possible:
- Big Data – know where to learn from
- Powerful hardware (GPUs) – models are trainable
- Deep – models are complex enough
2. Linear Regression
Machine learning
Experience = Data
Performance Measure = some function e.g. accuracy
Epoch = complete pass through all the data
Unsupervised learning:
– no labels or target class
– find properties of the structure in the data
– clustering (k-means, PCA)
Supervised learning:
– labels or target classes
Reinforcement learning:
– agents interact with an environment
– environment rewards agents
Linear Regression
kNN
Hyperparameters:
– Distance function (L1, L2)
– k (number of neighbors)
Cross validation
- Split the training data into N folds
- Train/Validate N times using each time different subsets
Maximum Likelihood (ML) Estimate
answers why MSE is the best estimate for linear regression
Regularization
– L(theta) = smth + lambda * R(theta), there R(theta) is a regularization term, e.g. L2 (theta^T * theta)
theta_1 = [1.5, 0, 0]
theta_2 = [0.25, 0.5, 0.25]
theta_2 is "better" because it takes information from all features while theta_1 ignores 2nd and 3rd
– supervised learning method
– finds a linear model that explains a target Y given the
inputs X
– y_i = sum{j=1}{d}{ x_ij * theta_j }
– x_i1 = 1, so that theta_1 is the bias
– loss function: L(theta) = 1/N * sum{i=1}{N}{ (y_i - Y_i)^2 } is Mean Squared Error (MSE)
– optimization: dL(theta)/dtheta = 0
– closed-form: theta = (X^T * X)^-1 * X^T * y
Overfitting
– train error low, validation error high
– model is to complex for the data
– too low regularization coefficient
Underfitting
– train and validation errors are high
– too simple model for complex data
– too high regularization parameter
Training Process
3. Optimization
Maximum a Posteriori (MAP) Estimate
Gradient Descent
– steps in the direction of a negative gradient
x` = x - e * grad_x{ f(x) }; e – learning rate
– in general, not guaranteed to reach the optimum
Numerical gradient
– slow, approximate, but easy-to-write
Analytical gradient
Fast, exact, but error-prone
Classic optimizers
Momentum update
– designed to accelerate training by accumulating gradients
– introduces more hyperparameters
Nesterov’s momentum
– look-ahead momentum
– introduces more hyperparameters
Adaptive learning rates
SGD
AdaGrad
Adam
Importance of the learning rate
Newton’s method
– get rid of the learning rate!!!
– approximates loss function by a second-order Taylor series expansion
– exploits the curvature to take a more direct route
– useful only if you can apply it for all data at once
BFGS and L-BFGS
– belongs to the family of quasi-Newton methods
– use approximations of the inverse of the Hessian
Backprop
– all optimization schemes are based on computing gradients
– backprop breaks down gradient computation (using chain rule) if loss function is too complex
The flow of the gradients
4. Intro to NN
5. Training NN
6. CNN
7. RNN
Project
Adapted WaveNet
Input: raw audio, wav
Output: raw audio, wav
Preprocessing:
Downsampling to 16kHz, simple addition model to make the dataset using DCD100, silence removal
Loss function:
L1
click to edit