or: Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks

An alternative to BatchNorm. Instead of normalizing the distribution of pre-activations in each layer for each minibatch, L2-normalize the weights of each layer to a fixed length. Gradients can be determined in terms of the normalized vector components during backprop, which the authors determined is more effective than just post-normalizing the weights after each gradient descent step.

Most networks have many more activations than weights, making this a pretty cheap operation.

Combining weight normalization with mean-subtraction of activations per-batch is a cheaper replacement for batchNorm, and the authors report it’s more effective. However, it introduces an extra length parameter, is less stable in training large networks, and subsequent papers have been more critical.