ViT-C // Lexicon

or: Early Convolutions Help Transformers See Better

This substitutes a stack of strided 3x3 convolutions squeezed to 1x1, which brings the image resolution down to 14x14. This is then handed off to a transformer stack one layer shorter than in the ViT paper, to approximately match the Flops required.

It provides a slight accuracy benefit, and a huge stability benefit.
They note that their augmentation setup, using autoaugment, mixup, cutmix, and label smoothing, worked better for them than the DeIT augmentation stack, with cutting out the repeat augmentations converging faster.
They also note that under their training setup, resNet is surprisingly competitive, matching ViT and beating an equivalent EfficientNet.
RegNets seem about the same as ResNets.