or: Depthwise Separable Convolutions for Neural Machine Translation

This paper advocates using depthwise-separable convolutions: these are convolutions which don’t mix information across channels. if there’s a 3x3 convolution with 300 channels, it’s dramatically more efficient to handle this as 300 separate 3x3 convolutions. Later, you can mix the channels using a 1x1 convolution across channels.

They also introduce super-separable convolutions, which do only groupwise channel mixing after the separate spatial convolutions.

This pattern, similar to MobileNets, is then applied along with inner product attention to machine translation. It’s efficient, but not as effective as transformers.