RoBERTa: A Robustly Optimized BERT Pretraining Approach

Authors determine that BERT pretraining is more effective with bigger datasets, and without the next sentence prediction task. With enough training, it reaches SOTA before saturating.

Later architectural innovations were apparently not meaningful contributions, BERT was just hungry.