Chinchilla
or: Training Compute-Optimal Large Language Models
For a given compute budget, training over more data improves current models more than increasing its param count.
Chinchilla is a 70B param model trained on more than 5x the training data as Megatron-Turing (530B), and outperforms it.