or: Pay Attention to MLPs

Authors introduce a Spatial Gating Unit (SGU), which gates its activation based on input from a learned spatial pattern shared among all channels.

Kind of like a depth-separable convolution with shared weights for all channels.

An MLP shaped like a transformer network, but using SGU for conditional spatial mixing, is effective on both ImageNet and BERT tasks.

see a simple SGU implementation