DiffWave is a fast, high-quality neural vocoder and waveform synthesizer. It starts with Gaussian noise and converts it into speech via iterative refinement. The speech can be controlled by providing a conditioning signal (e.g. log-scaled Mel spectrogram). The model and architecture details are described in DiffWave: A Versatile Diffusion Model for Audio Synthesis.
- stable training
- high-quality synthesis
- mixed-precision training
- multi-GPU training
- command-line inference
- programmatic inference API
- PyPI package
- audio samples
- pretrained models
Big thanks to Zhifeng Kong (lead author of DiffWave) for pointers and bug fixes.
...coming soon...
...coming soon...
Install using pip:
pip install diffwave
or from GitHub:
git clone https://github.com/lmnt-com/diffwave.git
cd diffwave
pip install .
Before you start training, you'll need to prepare a training dataset. The dataset can have any directory structure as long as the contained .wav files are 16-bit mono (e.g. LJSpeech, VCTK). By default, this implementation assumes a sample rate of 22.05 kHz. If you need to change this value, edit params.py.
python -m diffwave.preprocess /path/to/dir/containing/wavs
python -m diffwave /path/to/model/dir /path/to/dir/containing/wavs
# in another shell to monitor training progress:
tensorboard --logdir /path/to/model/dir --bind_all
You should expect to hear intelligible (but noisy) speech by ~8k steps (~1.5h on a 2080 Ti).
By default, this implementation uses as many GPUs in parallel as returned by torch.cuda.device_count()
. You can specify which GPUs to use by setting the CUDA_DEVICES_AVAILABLE
environment variable before running the training module.
Basic usage:
from diffwave.inference import predict as diffwave_predict
model_dir = '/path/to/model/dir'
spectrogram = # get your hands on a spectrogram in [N,C,W] format
audio, sample_rate = diffwave_predict(spectrogram, model_dir)
# audio is a GPU tensor in [N,T] format.
python -m diffwave.inference /path/to/model /path/to/spectrogram -o output.wav