wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

A strided CNN accepts raw audio input and outputs a latent encoding.

The latent encodings get passed to a transformer, which uses contrastive loss between correct sequences and portions of sequences from other time-steps from the same sequence.