Selfie: Self-supervised Pretraining for Image Embedding

A visual analogue to BERT pre-training. A truncated resNet backbone passes small image patches into a transformer block. Some of the patches are masked out, and the transformer should determine which of the real patch and several distractor patches sampled from different parts of the same image, correctly completes the image. The network is trained using contrastive loss. The network converges quickly and produces a high-quality resNet backbone, with better performance when fine-tuned than a regular classification network.