Lite Transformer
Lite Transformer With Long-Short Range Attention
Effective tiny transformer models can be created by having short-context attention performed by convolution, and fewer, long-range context heads using full self-attention.
from a laptop in Sunnyvale
Lite Transformer With Long-Short Range Attention
Effective tiny transformer models can be created by having short-context attention performed by convolution, and fewer, long-range context heads using full self-attention.