Please enable JavaScript.
Coggle requires JavaScript to display documents.
An overview of gradient descent optimization algorithms (SGD optimization…
An overview of gradient descent optimization algorithms
Gradient descent variants
Batch gradient descent
Guaranteed to find local minimum for non-convex error surfaces
Guaranteed to find global minimum for convex error surfaces
Stochastic gradient descent
Jaggy objective (loss) function
Almost certainly converging to local or global minimum for non-convex and convex optimization respectively
Mini-batch gradient descent
Most common variant
Challenges of SGD
Choosing learning rate
LR schedules cannot adapt to dataset characteristics
All parameters are updated with the same LR
Getting trapped in local minima
SGD optimization algorithms
Momentum
Helps to make faster progress in ravines
Nesterov accelerrated gradient
Gradient is calculated w.r.t. the approximate future position of parameters
Works very well for RNNs
Adagrad
Well suited for dealing with sparse data
Use higher learning rates for infrequent features
No need to manually tune the LR
Weakness: squared gradients are accumulated in the denominator
Adadelta
Extension of Adagrad
Only a window of squared gradients is used
No need to set the default LR
RMSprop
Proposed by Geoff Hinton in his Coursera Class
Similar to Adadelta
Adam
Empirically performs better than other adaptive learning-method algorithms
Behaves like a heavy ball with friction
RMSprop + Momentum
AdaMax
A different version of Adam
Nadam
Adam + NAG
AMSGrad
Solves short-memory problem with other improvements on Adagrad
Parallelizing and distributing SGD
Higwild!
Downpour SGD
Delay-tolerant algorithms fo SGD
Elastic averaging SGD
Additional strategies for optimizing SGD
Shuffling and Curriculum Learning
Shuffle data after each epoch
Batch normalization
Has regularizing effect
Early stopping
Gradient noise
Helps to train very deep and complex models