or: Bottleneck Transformers for Visual Recognition

Replacing the 3x3 bottleneck convolutions in a resNet with transformer layers creates an effective backbone for image tasks.

compared to an unmodified resNet, it has fewer params but consumes more Flops.
When heavily pretrained, accuracy is slightly better than SENets and other resNet variants.