MEGA: Moving Average Equipped Gated Attention

A mixture of GRU with exponential moving average, Gated Attention Unit, and S4/GSS.
Uses single-headed, linear attention. SOTA on long range sequences, competitive at text tasks.