Wav2vec 2.0
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
A strided CNN accepts raw audio input and outputs a latent encoding.
The latent encodings get passed to a transformer, which uses contrastive loss between correct sequences and portions of sequences from other time-steps from the same sequence.