or: Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

A giant switched-multimodal transformer backbone is SOTA on individual tasks after multimodal pretraining.

(TODO: look into differences between BEiT and MultiModel, ImageBind, etc.)