or: On the Sub-layer Functionalities of Transformer Decoder

The authors experiment with different decoder sublayers on NMT tasks, and find that the feedforward sublayers are dead weight, at least for their 65MParam transformers-base model.

Contrasts with T5-11B, which packs params into dense feedforward layers for better TPU throughput.