or: Neural Machine Translation of Rare Words with Subword Units

Introduces Byte Pair Encoding (BPE) for tokenization. Begin with character-level encodings. repeat this procedure:

  • collect statistics of all bigrams.
  • Merge the most frequent bigram into a single sample.

Keep going as long as you have fewer than 256^2 tokens.