CoAtNet: Marrying Convolution and Attention for All Data Sizes

A family of networks with different Param/accuracy tradeoffs, using a structured architecture:

Strided Conv->
Conv->
56x56xD1 MBConv (L1 layers) ->
28x28xD2 MBConv (L2 layers) ->
14x14xD3 Transformer (L3 Layers)->
7x7xD4 Transformer (L4 layers)

Similar performance at 75MParams to BoTNet. Achieves SOTA on imagenet after large-scale pretraining, when scaled to an eye-watering 2.4BParams. (eye-watering for imagenet, at least..)