kan-bayashi/ParallelWaveGAN

Progress report

kan-bayashi opened this issue Β· 50 comments

This issue records the progress of training.

I finished the almost all of the implementation.
I will start the training based on the recipe.
https://github.com/kan-bayashi/ParallelWaveGAN/blob/master/egs/ljspeech/voc1/run.sh

75k samples are available.
https://drive.google.com/open?id=1sd_QzcUNnbiaWq7L0ykMP7Xmk-zOuxTi

The training curve is as follows:
γ‚Ήγ‚―γƒͺγƒΌγƒ³γ‚·γƒ§γƒƒγƒˆ 2019-11-02 εˆε‰11 32 26

Interestingly, GAN is not applied yet but the quality is not so bad.
WaveNet architecture and STFT-based loss are very strong.

@kan-bayashi looks promising. I have one vacant v100, I definitely going to start training this weekend.

Hi @rishikksh20.
That is nice.
If you find any important points, please share with me.

Added 100k and 130k samples.
https://drive.google.com/open?id=1sd_QzcUNnbiaWq7L0ykMP7Xmk-zOuxTi
It seems the quality is improving gradually.

@kan-bayashi Yes sound quality improved lot, without headphones its hard to judge. I am also started fresh training to determine several parameters for TTS baseline.
What is inference speed, is it real time ?

Added 170k samples.
https://drive.google.com/open?id=1sd_QzcUNnbiaWq7L0ykMP7Xmk-zOuxTi

@rishikksh20 The speed is amazing. I think this is enough.

2019-11-03 09:07:03,286 (decode:89) INFO: the number of features to be decoded = 250.
2019-11-03 09:07:10,390 (decode:103) INFO: loaded model parameters from exp/train_ljspeech_parallel_wavegan.v1/checkpoint-175000steps.pkl.
[decode]: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 250/250 [00:30<00:00,  8.31it/s, RTF=0.0156]
2019-11-03 09:07:40,480 (decode:127) INFO: finished generation of 250 utterances (RTF = 0.016).

Added 200k samples.
https://drive.google.com/open?id=1sd_QzcUNnbiaWq7L0ykMP7Xmk-zOuxTi
I think it reached a nice quality.

I attached the training curve.
Pulse in 120k is caused by resuming. Please ignore it.

Spectral convergence loss

γ‚Ήγ‚―γƒͺγƒΌγƒ³γ‚·γƒ§γƒƒγƒˆ 2019-11-03 午後3 29 46

log STFT magnitude loss

γ‚Ήγ‚―γƒͺγƒΌγƒ³γ‚·γƒ§γƒƒγƒˆ 2019-11-03 午後3 29 57

adversarial loss

γ‚Ήγ‚―γƒͺγƒΌγƒ³γ‚·γƒ§γƒƒγƒˆ 2019-11-03 午後3 31 05

Generator loss (sc + mag + adv)

γ‚Ήγ‚―γƒͺγƒΌγƒ³γ‚·γƒ§γƒƒγƒˆ 2019-11-03 午後3 31 40

fake loss

γ‚Ήγ‚―γƒͺγƒΌγƒ³γ‚·γƒ§γƒƒγƒˆ 2019-11-03 午後3 31 58

real loss

γ‚Ήγ‚―γƒͺγƒΌγƒ³γ‚·γƒ§γƒƒγƒˆ 2019-11-03 午後3 32 23

Discriminator loss (fake + real)

γ‚Ήγ‚―γƒͺγƒΌγƒ³γ‚·γƒ§γƒƒγƒˆ 2019-11-03 午後3 31 17

I made Mandarin and Japanese recipes.
I will start to train them.

Added 280k samples.
Now it reached almost the same quality as the official samples.
https://r9y9.github.io/demos/projects/icassp2020/

Added jsut and csmsc intermediate samples.

I added arctic recipe. #24
If it is possible to train with only 1000 utterances, that is great news for the voice conversion.
I will try to train.

@kan-bayashi The original model is 1.4MB, why is the model size 16.7MB now?

@xzm2004260
You mean the number of parameters? (It is not 1.4MB but 1.4M.)
This is because the checkpoint includes all of the states of the models, optimizers, and schedulers.
You can extract only the generator parameters as follows:

import torch
states = torch.load(checkpoint_path)["model"]["generator"]
torch.save(states, "test.pkl")

I am comparing two models, one follows the paper suggestion to start to train the discriminator from 100k iters, and the other starts to train from the first step.
γ‚Ήγ‚―γƒͺγƒΌγƒ³γ‚·γƒ§γƒƒγƒˆ 2019-11-04 午後7 47 52
From the training curve, the discriminator can be trained from the first step.
And in terms of log Magnitude STFT loss, the model trained from the first step is better while spectral convergence loss is almost the same.

@kan-bayashi How to train the model with the specified GPU ? and will support Multi GPUs training ?

@kobenaxie You can specify the GPU via CUDA_VISIBLE_DEVICES e.g.

CUDA_VISIBLE_DEVICES=1 ./run.sh --stage 2

Multi-gpu training can be supported but currently, I do not feel the necessity because it can train with 12 GB single GPU and it finishes within ~3 days.

@kan-bayashi thank you for your reply very very much ~

I added how-to-specify the gpu in README.
https://github.com/kan-bayashi/ParallelWaveGAN#run

I finished the initial training and made demo HP to compare the quality to the official samples.
https://kan-bayashi.github.io/ParallelWaveGAN/

I added intermediate results of jsut, csmsc, and arctic.

I run this repo with LJSpeech on a 2080Ti gpu. When the code runs at 1000 steps, it raises an error.
image
Could you give me some advice on how to fix this?

@MorganCZY
The easiest way is to use a little bit smaller batch_max_steps.
(Make sure it should be the multiple of hop_size.

Could you give me the exact gpu memory size of 2080 Ti?
In my case, it requires 11613 MiB memory, so I think 12 GB memory is enough.

I found a bug in the evaluation.
I forgot to add torch.no_grad() in evaluation, causing additonal memory comsuming.
I will fix it.

image
I used gpu 0 shown in above picture. In training stage, the memory is enough. But after 1000 steps, it steps in eval process and the error occurs.

I found a bug in the evaluation.
I forgot to add torch.no_grad() in evaluation, causing additonal memory comsuming.
I will fix it.

That makes sense for my errors

After fixing (#26), the gpu memory usage is 11613 -> 11151 MiB.

@kan-bayashi The error fixes using #26 ! Thx~

I summarized the comparison #1 (comment) in #27, and found the issue of the discriminator.

I uploaded the initial model of csmsc and jsut.
https://drive.google.com/open?id=1sd_QzcUNnbiaWq7L0ykMP7Xmk-zOuxTi

I updated demo HP.

I finished arctic training.
Surprisingly, Parallel WaveGAN can be trained with only 988 utterances.
https://drive.google.com/open?id=1sd_QzcUNnbiaWq7L0ykMP7Xmk-zOuxTi

Thanks to @erogol, now single node multi-process distributed multi-gpu training is supported (#30, #31).
If you want to try, please follow the instruction in README.
https://github.com/kan-bayashi/ParallelWaveGAN#run

Now I think single speaker model works well for various dataset. I will consider to implement multi speaker model.

I made the realtime E2E-TTS demonstration notebook.
You can try online with Google Colab.
English and Japanese models are available!
https://colab.research.google.com/gist/kan-bayashi/bc65baf72faaebe5efd601b013e07342/e2e-tts-demo.ipynb
I will add a Mandarin example tomorrow.

@kan-bayashi Is it possible to train on GTA from espnet's tacotron or fastspeech ?

@rishikksh20 It can be. Just replacing normalized feats with generated ones is OK.

I updated notebook to select tacotron2, transformer, and fastspeech!
https://colab.research.google.com/github/espnet/notebook/blob/master/tts_realtime_demo.ipynb

I added colab link. You can enjoy the realtime synthesis!

I added the Mandarin TTS example in the notebook.
It seems to be working but maybe the text frontend part has some bugs because I cannot understand Mandarin...

This implementation became stable.
I will close this issue.

@kan-bayashi I cloned the latest git master version, and run ljspeech with parallel_wavegan.v1.yaml on a single 2080ti (12G mem), and got the following very similar error as @MorganCZY.

[train]: 25%|β–ˆβ–ˆβ–Œ | 100001/400000 [14:26:31<239:27:09, 2.87s/it]Traceback (most recent call last): File "/opt/speech/tools/ParalleWaveGAN/tools/venv/bin/parallel-wavegan-train", line 11, in <module> load_entry_point('parallel-wavegan', 'console_scripts', 'parallel-wavegan-train')() File "/opt/speech/tools/ParalleWaveGAN/parallel_wavegan/bin/train.py", line 760, in main trainer.run() File "/opt/speech/tools/ParalleWaveGAN/parallel_wavegan/bin/train.py", line 87, in run self._train_epoch() File "/opt/speech/tools/ParalleWaveGAN/parallel_wavegan/bin/train.py", line 260, in _train_epoch self._train_step(batch) File "/opt/speech/tools/ParalleWaveGAN/parallel_wavegan/bin/train.py", line 202, in _train_step gen_loss.backward() File "/opt/speech/tools/ParalleWaveGAN/tools/venv/lib/python3.6/site-packages/torch/tensor.py", line 166, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/opt/speech/tools/ParalleWaveGAN/tools/venv/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: CUDA out of memory. Tried to allocate 38.00 MiB (GPU 0; 10.76 GiB total capacity; 9.51 GiB already allocated; 28.69 MiB free; 46.32 MiB cached)

I checked the fix #26, but the code seems to be reverted in the current version of parallel_wavegan/bin/train.py. Any solutions to fix?

@vjdtao This is different. From 100001 iters, the discriminator is introduced in the training. Therefore the requires memory size will be increased. 2080Ti has less memory size than Titan V, so the OOM error will happen. Please change batch size or batch max steps to be a little bit smaller.

@kan-bayashi thanks for the reply, I'll change the batch_size in the config file.
Should I start the training from scratch or resume from 100000steps?
2080Ti has 12G mem, how much mem does Titan V have?

Titan V has 12036 MB memory.
You can resume from 100k steps if you do not care about the reproducibility.

@kan-bayashi After tuned the batch_size down a bit, the training was finished successfully. Thanks much!
Also sorry for the incorrect memory size of 2080Ti in my previous comment, the correct size should be 11G.

Hi @kan-bayashi , sorry to reopen this issue again , I see you make some code change after people mentioning the lack of memory of rtf 2080 ti (11G), is it now solved? could I train 400k steps with 2080 ti card now? thanks.

I did not check since I do not hvae 2080ti, but required memory is not changed.
Please make the batch small if OOM happens.

thanks for your advice!