or: Big Bird: Transformers for Longer Sequences

BigBird uses a mixture of three attention types: local, global, and random.
Some heads have global attention over long sequence lengths, and all heads have a mix of short local context and random-windowed context.

It beats other models at question answering and other long-sequence NLP tasks.