FastText Advances
or: Advances in Pre-Training Distributed Word Representations
This is essentially a release note for fastText models trained on several large datasets.
The advances are:
- using a word2vec-style prediction of a word based on its context, rather than context words based on each word
- a way of creating appropriate tokens by iteratively merging bigrams with high mutual information. That seems like a good way to create text tokens, by merging character bigrams iteratively…
This is similar to BPE.