Primer: Searching for Efficient Transformers for Language Modeling

Tinkering with Transformer tweaks for large language models gives two improvements:

  • use squared relu activation in the feedforward layers
  • do depthwise convolution after projecting the self-attention heads’ Q,K,and V matrices.

These let their network converge several times faster than unmodified versions.