or: Training Compute-Optimal Large Language Models

For a given compute budget, training over more data improves current models more than increasing its param count.

Chinchilla is a 70B param model trained on more than 5x the training data as Megatron-Turing (530B), and outperforms it.