or: Distributed Representations of Words and Phrases and their Compositionality

Expanding on the previous word2vec paper, the authors introduce negative sampling and phrase merging for creating word embeddings using a skip-gram model. It also advocates subsampling of common tokens to avoid overfitting on common but uninformative words.

Negative sampling is a form of contrastive loss between samples from the true word context, vs random samples from the vocabulary.

Their phrase merging is based on scoring bigram frequency relative to its component unigrams’ frequencies:

score(wi,wj)=count(wiwj)δcount(wi)×count(wj) \Large score(w_i, w_j) = \frac{count(w_i w_j) - \delta} {count(w_i)\times count(w_j)}

where δδ is a parameter to suppress phrase formation from infrequent bigrams.