or: Long Range Language Modeling via Gated State Spaces

Use gating to shrink the required convolution length. Faster throughput than DSS, at the expense of complexity.
Also, increased accuracy by adding in some self-attention heads for better local context.