Deep Learning

1. Intro

Computer Vision mimics the Human Visual System.
It's the center block of robotic intelligence.
First defined in 60s in AI groups

Why DL in CV is possible:

Big Data – know where to learn from
Powerful hardware (GPUs) – models are trainable
Deep – models are complex enough

2. Linear Regression

Machine learning

Experience = Data
Performance Measure = some function e.g. accuracy
Epoch = complete pass through all the data

Unsupervised learning:
– no labels or target class
– find properties of the structure in the data
– clustering (k-means, PCA)

Supervised learning:
– labels or target classes

Reinforcement learning:
– agents interact with an environment
– environment rewards agents

Linear Regression

kNN
Hyperparameters:
– Distance function (L1, L2)
– k (number of neighbors)

Cross validation

Split the training data into N folds
Train/Validate N times using each time different subsets

Maximum Likelihood (ML) Estimate
answers why MSE is the best estimate for linear regression

Regularization
– L(theta) = smth + lambda * R(theta), there R(theta) is a regularization term, e.g. L2 (theta^T * theta)

theta_1 = [1.5, 0, 0]
theta_2 = [0.25, 0.5, 0.25]
theta_2 is "better" because it takes information from all features while theta_1 ignores 2nd and 3rd

– supervised learning method
– finds a linear model that explains a target Y given the
inputs X
– y_i = sum{j=1}{d}{ x_ij * theta_j }
– x_i1 = 1, so that theta_1 is the bias
– loss function: L(theta) = 1/N * sum{i=1}{N}{ (y_i - Y_i)^2 } is Mean Squared Error (MSE)
– optimization: dL(theta)/dtheta = 0
– closed-form: theta = (X^T * X)^-1 * X^T * y

Overfitting
– train error low, validation error high
– model is to complex for the data
– too low regularization coefficient

Underfitting
– train and validation errors are high
– too simple model for complex data
– too high regularization parameter

Training Process

3. Optimization

Maximum a Posteriori (MAP) Estimate

Gradient Descent
– steps in the direction of a negative gradient
x` = x - e * grad_x{ f(x) }; e – learning rate
– in general, not guaranteed to reach the optimum

Numerical gradient
– slow, approximate, but easy-to-write

Analytical gradient
Fast, exact, but error-prone

Classic optimizers

Momentum update
– designed to accelerate training by accumulating gradients
– introduces more hyperparameters

Nesterov’s momentum
– look-ahead momentum
– introduces more hyperparameters

Adaptive learning rates

SGD

AdaGrad

Adam

Importance of the learning rate

Newton’s method
– get rid of the learning rate!!!
– approximates loss function by a second-order Taylor series expansion
– exploits the curvature to take a more direct route
– useful only if you can apply it for all data at once

BFGS and L-BFGS
– belongs to the family of quasi-Newton methods
– use approximations of the inverse of the Hessian

Backprop
– all optimization schemes are based on computing gradients
– backprop breaks down gradient computation (using chain rule) if loss function is too complex

The flow of the gradients

4. Intro to NN

5. Training NN

6. CNN

7. RNN

Project
Adapted WaveNet
Input: raw audio, wav
Output: raw audio, wav
Preprocessing:
Downsampling to 16kHz, simple addition model to make the dataset using DCD100, silence removal
Loss function:
L1

click to edit