SGD
Plain Stochastic Gradient Descent:
Move the weight along the direction of the gradient w/respect to that parameter, multiplied by a learning rate coefficient.
where is the learning rate.
w -= lr * dw
SGD with Momentum
Update the weight according to the velocity, which is an exponential moving average of past gradients
where is a value between 0 (normal SGD) and 1 (perfect momentum, velocity never updates).
= .995 is typical
Momemtum Decay
These have the momentum coefficient decrease as a function of training iteration .
Decay functions include:
Nesterov’s
Sutskever’s
Demon
where is the maximum timestep
Nesterov Momentum
aka Nesterov’s Accelerated Gradient or NAG
AdaGrad
Like SGD, but totals up the squared gradient for each parameter, and lowers the learning rate as the total accumulates.
where is a small value to prevent division by zero.
RMSProp
maintains an exp. moving average of the squared gradient for each param. unlike adagrad, it can continue to learn after a big early update.
AdaDelta
similar to RMSProp, with unwarranted additional complexity…
Adam
maintains both an EMA of the sq. gradient, and and EMA of the unsquared gradient for each param.
with reasonable hyperparams:
AdamW
Adam with weight decay.
AdaFactor
Doesn’t use L1 momentum, and stores only a factorized version of L2 momentum.
Takes much less VRAM to train.
Proposed Hyperparams: