or: Transcending Scaling Laws with 0.1% Extra Compute

Briefly retraining a huge causal model on bidirectional tasks improves generalization.

In this case, the authors retrained PaLM to create UPaLM.