WaveGrad

Implementation (PyTorch) of Google Brain's WaveGrad vocoder (paper: https://arxiv.org/pdf/2009.00713.pdf).

Status (STABLE VERSION)

Model training is stable and supports multi-iteration inference (6 iterations also work perfect).
Model training runs on a single 12GB GPU machine.
Model produces high-fidelity 22KHz generated samples. Uploaded samples for a different number of iterations.
Estimated the real-time factor (RTF) for the model (see the table below). 100- and lower-iteration inference is faster than real-time on NVIDIA RTX 2080 Ti. 6-iteration model is faster than the model reported in the paper.
Updated the code with new grid search utils for finding the best noise schedules.
Preparing pretrained checkpoints.

Real-time factor (RTF) and number of parameters

Model	Stable	RTF (NVIDIA RTX 2080 Ti), 22KHz
1000 iterations	True	9.59 ± 0.357
100 iterations	True	0.94 ± 0.046
50 iterations	True	0.45 ± 0.021
25 iterations	True	0.22 ± 0.011
12 iterations	True	0.10 ± 0.005
6 iterations	True	0.04 ± 0.005

Number of parameters: 15810401

About

WaveGrad is a conditional model for waveform generation through estimating gradients of the data density. The main concept of vocoder is its relatedness to diffusion models based on Langevin dynamics and score matching frameworks. w.r.t. Langevin dynamics WaveGrad achieves super-fast convergence (6 iterations).

Setup

Clone this repo:

git clone https://github.com/ivanvovk/WaveGrad.git
cd WaveGrad

Install requirements pip install -r requirements.txt

Train your own model

Make filelists of your audio data like ones included into filelists folder.
Setup a configuration in configs folder.
Change config path in train.sh and run the script by sh train.sh.
To track training process run tensorboard by tensorboard --logdir=logs/YOUR_LOG_FOLDER.
Once model is trained, grid search the best schedule for a needed number of iterations in notebooks/inference.ipynb.

Inference, generated audios and pretrained checkpoints

Inference, your runtime environment RTF and best schedule grid search

In order to make inference of your model follow the instructions provided in Jupyter Notebook notebooks/inference.ipynb. Also there is the code to estimate RTF in your runtime environment and to run best schedule grid search.

Generated audios

Current generated audios are provided in generated_samples folder. Quality degradation between 1000-iteration and 6-iteration inferences is not noticeable if found the best schedule for the latter.

Pretrained checkpoints

In progress.

Important details, issues and comments

During training WaveGrad uses a default noise schedule with 1000 iterations and linear scale betas from range (1e-6, 0.01). For inference you can set another schedule with less iterations (6 iterations reported in paper are ok for this implementation!). Tune betas carefully, the output quality really highly depends on them.
The best practice is to run grid search function iters_grid_search(...) from benchmark.py to find the best schedule for your number of iterations.
Model training succesfully runs on a single 12GB GPU machine. Batch size is modified compared to the paper (256 -> 48, as authors trained their model on TPU). After ~10k iterations (1-2 hours) model performs good generation for 50-iteration inference. Total training time is about 1-2 days (for absolute convergence).
Model converges to acceptable quality in 10-20 thousand iterations (~2 hours) for 50-iteration inference.
At some point training might start to behave very weird and crazy (loss explodes), so I have introduced learning rate (LR) scheduling and gradient clipping.
Hop length of your STFT should always be equal 300 (thus total upsampling factor). Other cases are not supported yet.
It is crucial to have LINEAR_SCALE=5000 (already set by default) flag in model/linear_modulation.py. It rescales positional embeddings to have absolute amplitude 1/LINEAR_SCALE. It is improtant since continious noise level and sinusoidal positional encodings both have equal range (-1, 1). No rescaling results that the model cannot properly extract noise info from positionally embedded continious noise level and thus cannot extrapolate on longer mel-spectrogram sequences (longer than training segment: 7200 timepoints by default).

History of updates

(NEW) Huge update. New 6-iteration well generated sample example. New noise schedule setting API. Added the best schedule grid search code.
Improved training by introducing smarter learning rate scheduler. Obtained high-fidelity synthesis.
Stable training and multi-iteration inference. 6-iteration noise scheduling is supported.
Stable training and fixed-iteration inference with significant background static noise left. All positional encoding issues are solved.
Stable training of 25-, 50- and 1000-fixed-iteration models. Found no linear scaling (C=5000 from paper) of positional encoding (bug).
Stable training of 25-, 50- and 1000-fixed-iteration models. Fixed positional encoding downscaling. Parallel segment sampling is replaced by full-mel sampling.
(RELEASE, first on GitHub). Parallel segment sampling and broken positional encoding downscaling. Bad quality with clicks from concatenation from parallel-segment generation.

References

Nanxin Chen et al., WaveGrad: Estimating Gradients for Waveform Generation
Jonathan Ho et al., Denoising Diffusion Probabilistic Models
Denoising Diffusion Probabilistic Models repository, from which diffusion calculations have been adopted

linzai1992/WaveGrad