Simple Recurrent Units for Highly Parallelizable Recurrence

The SRU is introduced, which is a speed-optimized recurrent cell. It claims a ~10x speedup vs LSTM on a GPU or accelerator.

see comparison