or: Learning Transferable Visual Models From Natural Language Supervision

Contrastive learning is applied to image captioning.