Evolved Transformer
The Evolved Transformer
Architecture search is used to find an improved transformer layer. The final result converges much faster than vanilla transformer, and is approximately SOTA after training on NMT.
The improved encoder layer involves a Gated Linear Unit and a depth-separable convolution.