Lite Transformer With Long-Short Range Attention

Effective tiny transformer models can be created by having short-context attention performed by convolution, and fewer, long-range context heads using full self-attention.