Routing Transformer
or: Efficient Content-Based Sparse Attention with Routing Transformers
Uses k-means clustering in the attention mechanism to decide which elements to sparsely attend to. Competitive with Transformer-XL.
from a laptop in Sunnyvale
or: Efficient Content-Based Sparse Attention with Routing Transformers
Uses k-means clustering in the attention mechanism to decide which elements to sparsely attend to. Competitive with Transformer-XL.