AudioLM: a Language Modeling Approach to Audio Generation

The authors use a pretrained neural audio codec which translates between audio waveforms and high-fidelity audio tokens, and use these tokens to feed a language model along with word-level/semantic tokens.