or: Query-Key Normalization for Transformers

This paper suggests a training issue with transformer attention heads is that the softmax can saturate.
They L2-normalize the Q and K weights, and perform cosine similarity scaled by a learned param instead of dot product.