or: Improving Language Understanding by Generative Pre-Training

A decoder-only transformer, 12 layers deep, using unlabeled text as input, and predicting next words from context. By adding a single fully connected layer on top of the learned representations, it can be used as a general-purpose backbone network for many NLP tasks. It trains much faster than a comparable LSTM model.