DeLighT: Deep And Light-Weight Transformer

A transformer which uses grouped linear projections and group shuffling to condense inputs to a smaller length before the attention layer, and re-expands them after. A deeper stack of these smaller attention layers is more effective on the tasks measured.