Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Simple modification of normal transformers.
Whenever the window shifts ahead, retain the hidden state instead of discarding it.
Simple, SOTA, and allows long extrapolations.