or: Thinking Like Transformers

This insightful paper develops a domain-specific language RASP for representing tasks achievable in a transformer architecture.
The full-attention mechanism can be used to perform aggregation tasks over its inputs, including sorting, histogramming, and averaging.
Furthermore, it is capable of learning these behaviors through backprop. It explains the improved performance of the ‘sandwich transformers’ architecture with attention in the front and feedforwards in the back, by allowing the network to learn a process which begins by collecting/arranging the data of interest before further local or elementwise computation is needed.