CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

This is a modified BERT model with a small-context transformer accepting character embeddings, followed by strided convolution to compress the character input to a manageable length.

To create character-level outputs, the latent representation is re-expanded and passed to a final, full-context output transformer.