All-Attention
or: Augmenting Self-attention with Persistent Memory
The authors introduce a transformer with fixed attention vectors acting as weights, and remove the feedforward layers, as the fixed vectors serve the same purposes. On a large adaptive-span model, they can halve the param count. They don’t tout any other significant advantages, but they do have nice results for adaptive attention contexts, with a 40MParam model compressing enwik8 to 1.01bpc.