Attention Is All You Need
Introduces the transformer architecture, in a clear and insigntful way. Important claim:
- Dot-product self-attention is more efficient than RNNs when the sequence length n is smaller than the representation dimension d.
This makes self-attention a good choice for handling:
- byte-pair encoded word tokens (d = 65536) over typical sequence lengths (n = 768).
- deep convnet encodings like the 7x7x512 layers in VGGNet and ResNet: (d = 512, n = 49)
It works better for NMT than ConvS2S, another architecture based on dot-product attention. It also beats ByteNet, a dynamically-expanding convNet.