Word2vec 2
or: Distributed Representations of Words and Phrases and their Compositionality
Expanding on the previous word2vec paper, the authors introduce negative sampling and phrase merging for creating word embeddings using a skip-gram model. It also advocates subsampling of common tokens to avoid overfitting on common but uninformative words.
Negative sampling is a form of contrastive loss between samples from the true word context, vs random samples from the vocabulary.
Their phrase merging is based on scoring bigram frequency relative to its component unigrams’ frequencies:
where is a parameter to suppress phrase formation from infrequent bigrams.