Word2vec 2 // Lexicon

or: Distributed Representations of Words and Phrases and their Compositionality

Expanding on the previous word2vec paper, the authors introduce negative sampling and phrase merging for creating word embeddings using a skip-gram model. It also advocates subsampling of common tokens to avoid overfitting on common but uninformative words.

Negative sampling is a form of contrastive loss between samples from the true word context, vs random samples from the vocabulary.

Their phrase merging is based on scoring bigram frequency relative to its component unigrams’ frequencies:

\Large score(w_i, w_j) = \frac{count(w_i w_j) - \delta} {count(w_i)\times count(w_j)}

where $δ$ is a parameter to suppress phrase formation from infrequent bigrams.