All-Attention // Lexicon

or: Augmenting Self-attention with Persistent Memory

The authors introduce a transformer with fixed attention vectors acting as weights, and remove the feedforward layers, as the fixed vectors serve the same purposes. On a large adaptive-span model, they can halve the param count. They don’t tout any other significant advantages, but they do have nice results for adaptive attention contexts, with a 40MParam model compressing enwik8 to 1.01bpc.