DeiT
or: Training data-efficient image transformers & distillation through attention
Transformers require more data than convNets, since convNets encode priors about locality. Paper’s contributions:
- Use lots of augmentation to supply enough data to train an image transformer from imageNet
- Use distillation to compact the transformer into a reasonable size for inference/generation.