ViT // Lexicon

or: An Image Is Worth 16x16 Words: Transformers For Image Recognition At Scale

This is a transformer-based architecture for image recognition and related backbone tasks.
It achieves SOTA-peer results for large-scale pretraining, based solely on transformers.
it begins with a learned projection of each 16x16 pixel block onto a transformer-input sized strip (e.g. 768)
Each patch is then projected to a strip and fed to a further transformer-encoder stack.