Primer
Primer: Searching for Efficient Transformers for Language Modeling
Tinkering with Transformer tweaks for large language models gives two improvements:
- use squared relu activation in the feedforward layers
- do depthwise convolution after projecting the self-attention heads’ Q,K,and V matrices.
These let their network converge several times faster than unmodified versions.