WaveNet: A Generative Model for Raw Audio

Use dilated convolutions with causal masking to generate the next sample using μ-law encoding over a 300ms sliding interval.
It includes the previous generated samples, making it autoregressive (and slow for inference). The samples can be conditioned on which speaker produces the audio. Another model was conditioned on pronunciation tokens and the desired fundamental frequency. The network uses gated sigmoid units.