or: Adaptive Attention Span in Transformers

This paper allows self-attention heads to learn a context length, up to 8k. Shallow layers learn only short contexts, and deeper layers learn a mix of short to long contexts.