Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing

A transformer model which shrinks the sequence length between transformer layers using avg-pooling. It might be useful for summarizing long sequences.

It squeaks out some marginal improvement on benchmarks.