NNCP
or: Lossless Data Compression with Neural Networks
Fabrice Bellard’s neural text compressor:
- Preprocess words and stems into vocab tokens.
- Train an LSTM-RNN or autoregressive transformer to predict the next word token.
- Use arithmetic coding based on the prediction vector.
Near-SOTA for enwik8/enwik9 text compression.
Weirdly, NNCP is set up to learn its weights from scratch, and compress in a single pass.