or: Advances in Pre-Training Distributed Word Representations

This is essentially a release note for fastText models trained on several large datasets.
The advances are:

  • using a word2vec-style prediction of a word based on its context, rather than context words based on each word
  • a way of creating appropriate tokens by iteratively merging bigrams with high mutual information. That seems like a good way to create text tokens, by merging character bigrams iteratively…
    This is similar to BPE.