or: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

One of many subquadratic attention variants. This one has a big speed advantage for autoregressive inference, at the expense of being more complicated and requiring a custom kernel. Not sure how much of the speed advantage is just from custom CUDA implementation…
If I’m not mistaken, the key innovation is to just shift the attention layer by one after each iteration, so each activation can be retained in a recurrent fashion as the sliding window shifts under it (see also Transformer-XL).

Note: this is one of several transformer variants called “Linear Transformers”. To disambiguate them, you might specify it as Katharopoulos et al’s version, or as “Transformers Are RNNs” linear attention.