or: Deep Networks with Stochastic Depth

When training a deep residual network like resnet or transformer, randomly bypass some layers in each batch. Speeds up convergence and allows deeper networks to be trained.