Adaptive-Span
or: Adaptive Attention Span in Transformers
This paper allows self-attention heads to learn a context length, up to 8k. Shallow layers learn only short contexts, and deeper layers learn a mix of short to long contexts.
from a laptop in Sunnyvale
or: Adaptive Attention Span in Transformers
This paper allows self-attention heads to learn a context length, up to 8k. Shallow layers learn only short contexts, and deeper layers learn a mix of short to long contexts.