CLIP
or: Learning Transferable Visual Models From Natural Language Supervision
Contrastive learning is applied to image captioning.

from a laptop in Sunnyvale
or: Learning Transferable Visual Models From Natural Language Supervision
Contrastive learning is applied to image captioning.