or: One Model To Learn Them All

A convolutional, ByteNet-style multimodal model is trained simultaneously on audio, image and text inputs, to create captions, transcriptions, categorizations, and language translations. They experiment with sparsely-gated mixture of experts feedforward layers and self-attention blocks, and find that both improve certain tasks, and neither makes any task worse.