IGLOO: Slicing the Feature Space to Represent Sequences

A weird non-transformer experiment in long-range attention:
Store a shallow intermediate layer of conv-processed input. Multiple context heads retrieve a concatenation of random slices of that intermediate layer.
Who needs locality-sensitive hashing for context retrieval when random sampling is good enough in the long run?
How does this work as an attention mechanism?

  • Attend to particular discoveries: “I’ve found the golden ticket!” shouts some grammar-specific detector…

It seems similar to BigBird for long sequence learning