Progress report

Question

Progress report

kan-bayashi opened this issue 5 years ago · 50 comments

This issue records the progress of training.

Answer 1 · 2019-11-01T11:20:35.000Z

I finished the almost all of the implementation.
I will start the training based on the recipe.
https://github.com/kan-bayashi/ParallelWaveGAN/blob/master/egs/ljspeech/voc1/run.sh

Answer 2 · 2019-11-02T02:37:39.000Z

75k samples are available.
https://drive.google.com/open?id=1sd_QzcUNnbiaWq7L0ykMP7Xmk-zOuxTi

The training curve is as follows:

Interestingly, GAN is not applied yet but the quality is not so bad.
WaveNet architecture and STFT-based loss are very strong.

Answer 3 · 2019-11-02T08:07:32.000Z

@kan-bayashi looks promising. I have one vacant v100, I definitely going to start training this weekend.

Answer 4 · 2019-11-02T08:46:19.000Z

Hi @rishikksh20.
That is nice.
If you find any important points, please share with me.

Answer 5 · 2019-11-02T14:15:13.000Z

Added 100k and 130k samples.
https://drive.google.com/open?id=1sd_QzcUNnbiaWq7L0ykMP7Xmk-zOuxTi
It seems the quality is improving gradually.

Answer 6 · 2019-11-02T17:28:15.000Z

@kan-bayashi Yes sound quality improved lot, without headphones its hard to judge. I am also started fresh training to determine several parameters for TTS baseline.
What is inference speed, is it real time ?

Answer 7 · 2019-11-03T00:09:17.000Z

Added 170k samples.
https://drive.google.com/open?id=1sd_QzcUNnbiaWq7L0ykMP7Xmk-zOuxTi

@rishikksh20 The speed is amazing. I think this is enough.

2019-11-03 09:07:03,286 (decode:89) INFO: the number of features to be decoded = 250.
2019-11-03 09:07:10,390 (decode:103) INFO: loaded model parameters from exp/train_ljspeech_parallel_wavegan.v1/checkpoint-175000steps.pkl.
[decode]: 100%|██████████| 250/250 [00:30<00:00,  8.31it/s, RTF=0.0156]
2019-11-03 09:07:40,480 (decode:127) INFO: finished generation of 250 utterances (RTF = 0.016).

Answer 8 · 2019-11-03T06:38:10.000Z

Added 200k samples.
https://drive.google.com/open?id=1sd_QzcUNnbiaWq7L0ykMP7Xmk-zOuxTi
I think it reached a nice quality.

I attached the training curve.
Pulse in 120k is caused by resuming. Please ignore it.

Spectral convergence loss

log STFT magnitude loss

adversarial loss

Generator loss (sc + mag + adv)

fake loss

real loss

Discriminator loss (fake + real)

Answer 9 · 2019-11-03T09:16:34.000Z

I made Mandarin and Japanese recipes.
I will start to train them.

Answer 10 · 2019-11-04T00:13:41.000Z

Added 280k samples.
Now it reached almost the same quality as the official samples.
https://r9y9.github.io/demos/projects/icassp2020/

Answer 11 · 2019-11-04T04:40:46.000Z

Added jsut and csmsc intermediate samples.

Answer 12 · 2019-11-04T05:36:32.000Z

I added arctic recipe. #24
If it is possible to train with only 1000 utterances, that is great news for the voice conversion.
I will try to train.

Answer 13 · 2019-11-04T07:22:07.000Z

@kan-bayashi The original model is 1.4MB, why is the model size 16.7MB now?

Answer 14 · 2019-11-04T08:16:52.000Z

@xzm2004260
You mean the number of parameters? (It is not 1.4MB but 1.4M.)
This is because the checkpoint includes all of the states of the models, optimizers, and schedulers.
You can extract only the generator parameters as follows:

import torch
states = torch.load(checkpoint_path)["model"]["generator"]
torch.save(states, "test.pkl")

Answer 15 · 2019-11-04T10:50:40.000Z

I am comparing two models, one follows the paper suggestion to start to train the discriminator from 100k iters, and the other starts to train from the first step.

From the training curve, the discriminator can be trained from the first step.
And in terms of log Magnitude STFT loss, the model trained from the first step is better while spectral convergence loss is almost the same.

Answer 16 · 2019-11-04T11:58:30.000Z

@kan-bayashi How to train the model with the specified GPU ? and will support Multi GPUs training ?

Answer 17 · 2019-11-04T12:01:48.000Z

@kobenaxie You can specify the GPU via CUDA_VISIBLE_DEVICES e.g.

CUDA_VISIBLE_DEVICES=1 ./run.sh --stage 2

Multi-gpu training can be supported but currently, I do not feel the necessity because it can train with 12 GB single GPU and it finishes within ~3 days.

Answer 18 · 2019-11-04T12:04:13.000Z

@kan-bayashi thank you for your reply very very much ~

Answer 19 · 2019-11-04T12:19:02.000Z

I added how-to-specify the gpu in README.
https://github.com/kan-bayashi/ParallelWaveGAN#run

Answer 20 · 2019-11-05T03:42:25.000Z

I finished the initial training and made demo HP to compare the quality to the official samples.
https://kan-bayashi.github.io/ParallelWaveGAN/

Answer 21 · 2019-11-05T04:11:36.000Z

I added intermediate results of jsut, csmsc, and arctic.

Answer 22 · 2019-11-05T07:40:26.000Z

I run this repo with LJSpeech on a 2080Ti gpu. When the code runs at 1000 steps, it raises an error.

Could you give me some advice on how to fix this?

Answer 23 · 2019-11-05T07:56:03.000Z

@MorganCZY
The easiest way is to use a little bit smaller batch_max_steps.
(Make sure it should be the multiple of hop_size.

Could you give me the exact gpu memory size of 2080 Ti?
In my case, it requires 11613 MiB memory, so I think 12 GB memory is enough.

Answer 24 · 2019-11-05T08:00:59.000Z

I found a bug in the evaluation.
I forgot to add torch.no_grad() in evaluation, causing additonal memory comsuming.
I will fix it.

Answer 25 · 2019-11-05T08:02:03.000Z

I used gpu 0 shown in above picture. In training stage, the memory is enough. But after 1000 steps, it steps in eval process and the error occurs.

Answer 26 · 2019-11-05T08:04:01.000Z

@MorganCZY Try #26.

Answer 27 · 2019-11-05T08:04:18.000Z

I found a bug in the evaluation.
I forgot to add torch.no_grad() in evaluation, causing additonal memory comsuming.
I will fix it.

That makes sense for my errors

Answer 28 · 2019-11-05T08:16:01.000Z

After fixing (#26), the gpu memory usage is 11613 -> 11151 MiB.

Answer 29 · 2019-11-05T08:30:02.000Z

@kan-bayashi The error fixes using #26 ! Thx~

Answer 30 · 2019-11-05T15:10:37.000Z

I summarized the comparison #1 (comment) in #27, and found the issue of the discriminator.

Answer 31 · 2019-11-06T01:55:06.000Z

I uploaded the initial model of csmsc and jsut.
https://drive.google.com/open?id=1sd_QzcUNnbiaWq7L0ykMP7Xmk-zOuxTi

Answer 32 · 2019-11-06T02:12:24.000Z

I updated demo HP.

Answer 33 · 2019-11-07T00:20:39.000Z

I finished arctic training.
Surprisingly, Parallel WaveGAN can be trained with only 988 utterances.
https://drive.google.com/open?id=1sd_QzcUNnbiaWq7L0ykMP7Xmk-zOuxTi

Answer 34 · 2019-11-11T16:34:25.000Z

Thanks to @erogol, now single node multi-process distributed multi-gpu training is supported (#30, #31).
If you want to try, please follow the instruction in README.
https://github.com/kan-bayashi/ParallelWaveGAN#run

Answer 35 · 2019-11-13T01:58:15.000Z

Now I think single speaker model works well for various dataset. I will consider to implement multi speaker model.

Answer 36 · 2019-11-13T15:27:50.000Z

I made the realtime E2E-TTS demonstration notebook.
You can try online with Google Colab.
English and Japanese models are available!
https://colab.research.google.com/gist/kan-bayashi/bc65baf72faaebe5efd601b013e07342/e2e-tts-demo.ipynb
I will add a Mandarin example tomorrow.

Answer 37 · 2019-11-13T17:28:55.000Z

@kan-bayashi Is it possible to train on GTA from espnet's tacotron or fastspeech ?

Answer 38 · 2019-11-14T00:09:36.000Z

@rishikksh20 It can be. Just replacing normalized feats with generated ones is OK.

Answer 39 · 2019-11-14T09:21:39.000Z

I updated notebook to select tacotron2, transformer, and fastspeech!
https://colab.research.google.com/github/espnet/notebook/blob/master/tts_realtime_demo.ipynb

Answer 40 · 2019-11-14T09:34:13.000Z

I added colab link. You can enjoy the realtime synthesis!

Answer 41 · 2019-11-14T15:05:09.000Z

I added the Mandarin TTS example in the notebook.
It seems to be working but maybe the text frontend part has some bugs because I cannot understand Mandarin...

Answer 42 · 2020-01-15T01:36:40.000Z

This implementation became stable.
I will close this issue.

Answer 43 · 2020-02-19T17:31:23.000Z

@kan-bayashi I cloned the latest git master version, and run ljspeech with parallel_wavegan.v1.yaml on a single 2080ti (12G mem), and got the following very similar error as @MorganCZY.

[train]: 25%|██▌ | 100001/400000 [14:26:31<239:27:09, 2.87s/it]Traceback (most recent call last): File "/opt/speech/tools/ParalleWaveGAN/tools/venv/bin/parallel-wavegan-train", line 11, in <module> load_entry_point('parallel-wavegan', 'console_scripts', 'parallel-wavegan-train')() File "/opt/speech/tools/ParalleWaveGAN/parallel_wavegan/bin/train.py", line 760, in main trainer.run() File "/opt/speech/tools/ParalleWaveGAN/parallel_wavegan/bin/train.py", line 87, in run self._train_epoch() File "/opt/speech/tools/ParalleWaveGAN/parallel_wavegan/bin/train.py", line 260, in _train_epoch self._train_step(batch) File "/opt/speech/tools/ParalleWaveGAN/parallel_wavegan/bin/train.py", line 202, in _train_step gen_loss.backward() File "/opt/speech/tools/ParalleWaveGAN/tools/venv/lib/python3.6/site-packages/torch/tensor.py", line 166, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/opt/speech/tools/ParalleWaveGAN/tools/venv/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: CUDA out of memory. Tried to allocate 38.00 MiB (GPU 0; 10.76 GiB total capacity; 9.51 GiB already allocated; 28.69 MiB free; 46.32 MiB cached)

I checked the fix #26, but the code seems to be reverted in the current version of parallel_wavegan/bin/train.py. Any solutions to fix?

Answer 44 · 2020-02-20T00:33:45.000Z

@vjdtao This is different. From 100001 iters, the discriminator is introduced in the training. Therefore the requires memory size will be increased. 2080Ti has less memory size than Titan V, so the OOM error will happen. Please change batch size or batch max steps to be a little bit smaller.

Answer 45 · 2020-02-20T02:37:11.000Z

@kan-bayashi thanks for the reply, I'll change the batch_size in the config file.
Should I start the training from scratch or resume from 100000steps?
2080Ti has 12G mem, how much mem does Titan V have?

Answer 46 · 2020-02-20T03:13:52.000Z

Titan V has 12036 MB memory.
You can resume from 100k steps if you do not care about the reproducibility.

Answer 47 · 2020-02-24T21:41:26.000Z

@kan-bayashi After tuned the batch_size down a bit, the training was finished successfully. Thanks much!
Also sorry for the incorrect memory size of 2080Ti in my previous comment, the correct size should be 11G.

Answer 48 · 2020-12-01T09:16:01.000Z

Hi @kan-bayashi , sorry to reopen this issue again , I see you make some code change after people mentioning the lack of memory of rtf 2080 ti (11G), is it now solved? could I train 400k steps with 2080 ti card now? thanks.

Answer 49 · 2020-12-02T07:30:18.000Z

I did not check since I do not hvae 2080ti, but required memory is not changed.
Please make the batch small if OOM happens.

Answer 50 · 2020-12-02T07:33:17.000Z

thanks for your advice!