or: Efficient Content-Based Sparse Attention with Routing Transformers

Uses k-means clustering in the attention mechanism to decide which elements to sparsely attend to. Competitive with Transformer-XL.