or: Training data-efficient image transformers & distillation through attention

Transformers require more data than convNets, since convNets encode priors about locality. Paper’s contributions:

  • Use lots of augmentation to supply enough data to train an image transformer from imageNet
  • Use distillation to compact the transformer into a reasonable size for inference/generation.