or: Decoupled Weight Decay Regularization

Weight decay is helpful for generalization. Adam performs poorly compared to SGD on some image tasks because it interferes with common implementations of weight decay. AdamW should perform at least as well as both SGD and Adam with naive weight decay.

see Optimizer Comparison