Byte Pair Encoding
or: Neural Machine Translation of Rare Words with Subword Units
Introduces Byte Pair Encoding (BPE) for tokenization. Begin with character-level encodings. repeat this procedure:
- collect statistics of all bigrams.
- Merge the most frequent bigram into a single sample.
Keep going as long as you have fewer than 256^2 tokens.