Optimizers Compared // Lexicon

SGD

Plain Stochastic Gradient Descent:
Move the weight along the direction of the gradient w/respect to that parameter, multiplied by a learning rate coefficient.

w_{t+1} = w_t - \frac{\partial{}}{\partial{w}} * \eta

where $η$ is the learning rate.

w -= lr * dw

SGD with Momentum

Update the weight according to the velocity, which is an exponential moving average of past gradients

w_{t+1} = w_t + \eta v_{t} \\ v_{t} = \beta v_{t-1} + (1 - \beta) \left(-\frac{\partial{}}{\partial{w}}\right)

where $β$ is a value between 0 (normal SGD) and 1 (perfect momentum, velocity never updates).
$β$ = .995 is typical

Momemtum Decay

These have the momentum coefficient $β$ decrease as a function of training iteration $t$ .
Decay functions include:

Nesterov’s

\beta_t = 1 - \dfrac{3}{t + 5}

Sutskever’s

\large \beta_t = \min(1-2^{-1 - \log_2([t/250] + 1)}, \beta_0)

Demon

\beta_t = \frac{\beta_0(1 -\frac{T}{t})} {(1 - \beta_0) + \beta_0(1 -\frac{T}{t})}

where $T$ is the maximum timestep

Nesterov Momentum

aka Nesterov’s Accelerated Gradient or NAG

\phi_{t+1} = w_t - \frac{\partial{}}{\partial{w}} * \eta \\ w_{t+1} = \phi_{t+1} + \beta(\phi_{t+1} - \phi_t)

AdaGrad

Like SGD, but totals up the squared gradient for each parameter, and lowers the learning rate as the total accumulates.

w_{t+1} = w_t - \frac{\partial{}}{\partial{w}} * \frac{\eta}{\sqrt{G_t} + \epsilon} \\ \quad \\ G_{t+1} = G_t + \left(\frac{\partial{}}{\partial{w}}\right)^2

where $ϵ$ is a small value to prevent division by zero.

RMSProp

maintains an exp. moving average of the squared gradient for each param. unlike adagrad, it can continue to learn after a big early update.

w_{t+1} = w_t - \frac{\partial{}}{\partial{w}} * \frac{\eta}{\sqrt{G_t} + \epsilon} \\ \quad \\ G_{t} = (\beta) G_{t-1} + (1 - \beta) \left(\frac{\partial{}}{\partial{w}}\right)^2

AdaDelta

similar to RMSProp, with unwarranted additional complexity…

\begin{aligned} w_{t+1} & = w_t + v_{t} \\ V_{t} & = (\beta) V_{t-1} + (1 - \beta) v_{t-1}^2 \\ G_{t} & = (\beta) G_{t-1} + (1 - \beta) \left(\frac{\partial{}}{\partial{w}}\right)^2 \\ v_t & = -\frac{\sqrt{V_t}}{\sqrt{G_t} + \epsilon}\frac{\partial{}}{\partial{w}} \end{aligned}

Adam

maintains both an EMA of the sq. gradient, and and EMA of the unsquared gradient for each param.

\begin{aligned} w_{t+1} & = w_t + \frac{\hat{v}_t * \eta}{\sqrt{\hat{G}_t} + \epsilon} \\ v_{t} & = (\beta_1) v_{t-1} + (1 - \beta_1) \left(-\frac{\partial{}}{\partial{w}}\right) \\ \hat{v}_t & = v_t / (1 - {\beta_1}^{t+1}) \\ G_{t} & = (\beta_2) G_{t-1} + (1 - \beta_2) \left(\frac{\partial{}}{\partial{w}}\right)^2 \\ \hat{G}_t & = G_t / (1 - {\beta_2}^{t+1}) \end{aligned}

with reasonable hyperparams: $\eta = .001 \\ \beta_1 = .9 \\ \beta_2 = .999 \\ \eta = 10e-8$

AdamW

Adam with weight decay.

AdaFactor

Doesn’t use L1 momentum, and stores only a factorized version of L2 momentum. Takes much less VRAM to train.

\large \begin{aligned} \alpha_t & = \max(\epsilon_2, \text{RMS}(x_{t-1}))p_t \\ G_t & = \nabla f_t(X_{t-1}) \\ R_t & = (\beta_t)R_{t-1} + (1 - \beta_t)(G^2_t + \epsilon_1 1_n 1_m^\top)1_m \\ C_t & = (\beta_t)C_{t-1} + (1 - \beta_t)1_n^\top(G^2_t + \epsilon_1 1_n 1_m^\top) \\ V_t & = \frac{R_t C_t}{1_n^\top R_t} \\ U_t & = \frac{G_t}{\sqrt{v_t}} \\ \hat U_t & = \frac{U_t}{\max(1,\text{RMS}(U_t)/d)} \\ X_t & = X_{t-1} - \alpha_t \hat U_t \end{aligned}

Proposed Hyperparams: $\begin{aligned} \epsilon_1 & = 10^{-30} \\ \epsilon_2 & = 10^{-3} \\ d & = 1 \\ p_{t} & = \min\left(10^{-2}, \frac{1}{\sqrt{t}}\right) \\ \hat \beta_t & = 1 - t^{-0.8} \end{aligned}$