Introduces the transformer architecture, in a clear and insigntful way. Important claim:

  • Dot-product self-attention is more efficient than RNNs when the sequence length n is smaller than the representation dimension d.

This makes self-attention a good choice for handling:

  • byte-pair encoded word tokens (d = 65536) over typical sequence lengths (n = 768).
  • deep convnet encodings like the 7x7x512 layers in VGGNet and ResNet: (d = 512, n = 49)

It works better for NMT than ConvS2S, another architecture based on dot-product attention. It also beats ByteNet, a dynamically-expanding convNet.