or: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

A clean large transformer model is used to determine the most effective methods for pretraining and fine-tuning on the same large dataset.

The model uses text prompting ahead of the input to switch between various text-based tasks.

Pretraining is done BERT-style, on short corrupted spans rather than corrupted individual tokens.

They use layernorm without recentering, only rescaling.

Sequence length is 512, batch size is 2048 sequences (2MB input per batch using BPE tokens), with the model trained for 1M batches.

Large variants are created in part by adding more attention heads, but mostly scaling up the feedforward layers, b/c dense layer params are highly scalable on TPUs

Includes an insightful summary of self-attention:

Self-attention is a variant of attention that processes a sequence by replacing each element by a weighted average of the rest of the sequence.