BEiT-3
or: Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
A giant switched-multimodal transformer backbone is SOTA on individual tasks after multimodal pretraining.
(TODO: look into differences between BEiT and MultiModel, ImageBind, etc.)