Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

SGD-M requires storing a momentum estimate for each parameter. Adam requires two running estimates, L1 and L2, making models consume 3x the memory.

What if only the row sum and column sum of these optimizer values were retained? The L2 estimate for each parameter could be estimated from this factorization. And is the L1 really necessary? Not with a few additional tweaks:

  • update clipping
    • limit each update vector or matrix, to have a maximum RMS of some constant dd
    • proposed: d=1d = 1
  • relative step size
    • instead of a constant, use a proportion multiplied by the RMS of the previous update.
  • decaying step size
    • min(.01,t0.5)\min(.01, t^{-0.5})
  • increasing L2 momentum
    • avoids severe overshoot early in training
    • proposed: 1t0.81 − t^{−0.8}

Factorization only works for weight matrices, not vectors or biases. The other tweaks are applied to vector and bias updates as well.

Weight matrices then require (nm + n + m) memory during training, instead of (3nm) for Adam. Results are similar to Adam for transformers models. Not sure how it works for convnets.