DeiT // Lexicon

or: Training data-efficient image transformers & distillation through attention

Transformers require more data than convNets, since convNets encode priors about locality. Paper’s contributions:

Use lots of augmentation to supply enough data to train an image transformer from imageNet
Use distillation to compact the transformer into a reasonable size for inference/generation.