The Evolved Transformer

Architecture search is used to find an improved transformer layer. The final result converges much faster than vanilla transformer, and is approximately SOTA after training on NMT.

The improved encoder layer involves a Gated Linear Unit and a depth-separable convolution.