or: Learning Sparse Neural Networks Through L0 Regularization

An alternate scheme for encouraging model sparsity instead of weight decay. Weight decay on large models biases all params downwards, whereas this setup smoothly encourages zeroes but leaves most nonzero params alone.