[This README adapted to my modification of the WaveGlow implementation]
In our recent [paper], we propose WaveGlow: a flow-based network capable of
generating high quality speech from mel-spectrograms. WaveGlow combines insights
from [Glow] and [WaveNet] in order to provide fast, efficient and high-quality
audio synthesis, without the need for auto-regression. WaveGlow is implemented
using only a single network, trained using only a single cost function:
maximizing the likelihood of the training data, which makes the training
procedure simple and stable.
Our [PyTorch] implementation produces audio samples at a rate of 1200
kHz on an NVIDIA V100 GPU. Mean Opinion Scores show that it delivers audio
quality as good as the best publicly available WaveNet implementation.
Visit our [website] for audio samples.
-
Clone our repo and initialize submodule
git clone https://github.com/sungjae-cho/waveglow.git cd waveglow git submodule init git submodule update
-
Install requirements
pip3 install -r requirements.txt
-
Install Apex
- Download the published model. This is saved as
pretrained/waveglow_256channels_universal_v5.pt
- Download mel-spectrograms. These are saved in
mel_spectrograms
- Generate audio
python3 inference.py -f <(ls mel_spectrograms/*.pt) -w pretrained/waveglow_256channels_universal_v5.pt -o mel_spectrograms. --is_fp16 -s 0.6
. Then, the command outputs are below.
mel_spectrograms/LJ001-0015.wav_synthesis.wav
mel_spectrograms/LJ001-0051.wav_synthesis.wav
mel_spectrograms/LJ001-0063.wav_synthesis.wav
mel_spectrograms/LJ001-0072.wav_synthesis.wav
mel_spectrograms/LJ001-0079.wav_synthesis.wav
mel_spectrograms/LJ001-0094.wav_synthesis.wav
mel_spectrograms/LJ001-0096.wav_synthesis.wav
mel_spectrograms/LJ001-0102.wav_synthesis.wav
mel_spectrograms/LJ001-0153.wav_synthesis.wav
mel_spectrograms/LJ001-0173.wav_synthesis.wav
N.b. use convert_model.py
to convert your older models to the current model
with fused residual and skip connections.
import torch
from scipy.io.wavfile import write
from denoiser import Denoiser
sampling_rate = 22050
audio_path = 'audio.wav'
waveglow_path = 'pretrained/waveglow_256channels_universal_v5.pt'
waveglow = torch.load(waveglow_path)['model']
for k, m in waveglow.named_modules():
m._non_persistent_buffers_set = set() # pytorch 1.6.0 compatability
waveglow = waveglow.remove_weightnorm(waveglow)
waveglow.cuda().eval()
wg_denoiser = Denoiser(waveglow).cuda()
wg_denoiser_strength = 0.1 # Removes model bias. Start with 0.1 and adjust. Max 1.0.
wg_sigma = 0.6 # default value in the official paper
audio = wg_denoiser(waveglow.infer(mel_outputs_postnet, sigma=wg_sigma), wg_denoiser_strength)
# audio normalization
audio = audio.squeeze().cpu().numpy()
maxv = 2 ** (16 - 1)
audio /= max(abs(audio.max()), abs(audio.min()))
audio = (audio * maxv * 0.95).astype(np.int16)
write(audio_path, sampling_rate, audio)
-
Save a list of the file names for training into
train_files.txt
and for testing intotest_files.txt
. -
Train your WaveGlow networks
python train.py -c config.json --prj_name prj_name --run_name run_name --visible_gpus 1
For multi-GPU training replace
train.py
withdistributed.py
. Only tested with single node and NCCL.python distributed.py -c config.json --prj_name prj_name --run_name run_name --visible_gpus 1,2,3
For mixed precision training set
"fp16_run": true
onconfig.json
. To use this,apex
must be installed.
-
Make test set mel-spectrograms
mkdir test_out_dir python mel2samp.py -f test_files.txt -o test_out_dir -c config.json
-
Do inference with your network at iteration 10000 of the
checkpoints/prj_name/run_name/waveglow_10000
checkpoint.ls test_files/*.pt > file_lists/mel_files.txt python3 inference.py -f file_lists/mel_files.txt -w checkpoints/prj_name/run_name/waveglow_10000 -o test_out_dir --is_fp16 -s 0.6