ivanvovk/WaveGrad

Matplotlib API change & NaNs for short clips & new hop_length

thorstenMueller opened this issue · 27 comments

I'm trying to run training on a nvidia xavier agx device running nvidia docker container based on these https://ngc.nvidia.com/catalog/containers/nvidia:l4t-pytorch instructions.

But i receive following error:

Initializing logger...
Initializing model...
Number of parameters: 15810401
Initializing optimizer, scheduler and losses...
Initializing data loaders...

Traceback (most recent call last):
File "train.py", line 185, in
run(config, args)
File "train.py", line 72, in run
logger.log_specs(0, specs)
File "/media/908f901d-e80b-4a8e-8a16-9e0f1b896732/TTS/thorsten-de/models/model-v02/WaveGrad/logger.py", line 53, in log_specs
self.add_image(key, plot_tensor_to_numpy(image), iteration, dataformats='HWC')
File "/media/908f901d-e80b-4a8e-8a16-9e0f1b896732/TTS/thorsten-de/models/model-v02/WaveGrad/utils.py", line 66, in plot_tensor_to_numpy
im = ax.imshow(tensor, aspect="auto", origin="bottom", interpolation='none', cmap='hot')
File "/usr/local/lib/python3.6/dist-packages/matplotlib/init.py", line 1438, in inner
return func(ax, *map(sanitize_sequence, args), **kwargs)
File "/usr/local/lib/python3.6/dist-packages/matplotlib/axes/_axes.py", line 5521, in imshow
resample=resample, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/matplotlib/image.py", line 905, in init
**kwargs
File "/usr/local/lib/python3.6/dist-packages/matplotlib/image.py", line 246, in init
cbook._check_in_list(["upper", "lower"], origin=origin)
File "/usr/local/lib/python3.6/dist-packages/matplotlib/cbook/init.py", line 2257, in _check_in_list
.format(v, k, ', '.join(map(repr, values))))
ValueError: 'bottom' is not a valid value for origin; supported values are 'upper', 'lower'

python3 -V: Python 3.6.9
pip3 -V: 20.2.3

Running pip3 list shows following installed packages:

absl-py (0.10.0)
appdirs (1.4.4)
cachetools (4.1.1)
certifi (2020.6.20)
chardet (3.0.4)
cycler (0.10.0)
Cython (0.29.20)
decorator (4.4.2)
future (0.18.2)
google-auth (1.22.1)
google-auth-oauthlib (0.4.1)
grpcio (1.32.0)
idna (2.10)
importlib-metadata (2.0.0)
kiwisolver (1.2.0)
Mako (1.1.3)
Markdown (3.3)
MarkupSafe (1.1.1)
matplotlib (3.3.1)
numpy (1.19.0)
oauthlib (3.1.0)
Pillow (7.2.0)
pip (9.0.1)
protobuf (3.13.0)
pyasn1 (0.4.8)
pyasn1-modules (0.2.8)
pycuda (2019.1.2)
pyparsing (2.4.7)
python-dateutil (2.8.1)
pytools (2020.3.1)
requests (2.24.0)
requests-oauthlib (1.3.0)
rsa (4.6)
setuptools (50.3.0)
six (1.15.0)
tensorboard (2.3.0)
tensorboard-plugin-wit (1.7.0)
torch (1.6.0)
torchaudio (0.6.0a0+d6f81d1)
torchvision (0.7.0a0+6631b74)
tqdm (4.50.2)
urllib3 (1.25.10)
Werkzeug (1.0.1)
wheel (0.35.1)
zipp (3.3.0)

I tried matplotlib (3.3.1) and 3.3.2 both with same result.

Any ideas what i miss?
Thank you.

Hello. It's strange. Maybe they changed APIs in the latest versions. The version of matplotlib I am using is 3.2.1 and its ok. Try 2 things:

  1. Change value for key argument origin in this line from "bottom" to "lower". I guess it should do the same.
  2. If first step doesn't help, then try downgrading to 3.2.1 version.

Thanks for your quick support.
Changing "bottom" to "lower" worked for me. I've made a mini pr - hopefully it's helpful :-).

Training went (reproduceable) well until iteration 55. Then it's running into problems calculating loss stats.

Iteration: 52 | Losses: [15.780592918395996, 821.0237426757812]
Iteration: 53 | Losses: [4.594686985015869, 205.12646484375]
Iteration: 54 | Losses: [2.3868210315704346, 97.16974639892578]
Iteration: 55 | Losses: [1.1524507999420166, 78.44384002685547]
Iteration: 56 | Losses: [nan, nan]
Iteration: 57 | Losses: [nan, nan]
Iteration: 58 | Losses: [nan, nan]

Any idea on that?
Maybe i try to compile matplotlib 3.2.1 and try running with original "bottom" code.

No, it is not connected to matplotlib. It is loss explosion problem which occurs sometimes. Try set in config lr to 5e-4 and scheduler_gamma to 0.9 as mentioned in #3 issue.

Thanks for your reply and sorry that i've not seen this existing helpful issue before.
Sadly setting lr to 5e-4 and scheduler_gamma to 0.9 didn't change anything.

After reducing batchsize the "nan" problem occurs later:

  • Default batchsize 48: nan with step 56
  • Batchsize 32: nan with step 86
  • Batchsize 16: nan with step 184

Is there a better place for best practice config discussion than within this issue?

@thorstenMueller, I am planning to push a new version of WaveGrad soon, which should be more robust to loss explosion problem. Please, check it in a few days.

Thanks, sounds good.
I'll test again as soon you pushed a new version.

@thorstenMueller Hello, sorry for being a bit late. I have updated the repo. I believe it should be more robust to loss explosion issue now.

Hey @ivanvovk .

Thanks for the huge update 👍 . I’ve set up a training run with my available german dataset (https://github.com/thorstenMueller/deep-learning-german-tts/) and training is running for 1 day without stopping because of errors.
RunningWaveGradTraining

But i could need some help in understanding it’s progress. Do you have an account on Mozilla discourse so we could discuss my questions there (https://discourse.mozilla.org/t/contributing-my-german-voice-for-tts/) to not blow up this „issue“?

Following things are in my mind right now:

  1. Should tensorboard dependency be added in requirements.txt?

  2. When is the best time to run notebook (12 .pt checkpoint files written at the moment).
    Is current training process finished at any point before running notebook?

  3. Audio samples are pure random noise and predicted graphs don't change

  4. I see lots of „NaN“ points (triangles) in tensorboard graphs (grad norm graph in TB)

TensorBoardImages
TensorBoardScalars
TensorBoardGradNorm

This is my used wavegrad config.

config1

config2

Taco2 Training is based on "hop_length": 256 so i'll need to adjust "factors" in config. Currently wavegrad training has value auf hop_length = 300.

Would be great if you can support me on this :-).
Thanks so far.

Sorry, I have no account there. Write me on my mail iyuvovk@yandex.ru and we'll decide where to continue the discussion.

Okay, i've sent you an email.
If it's okay for you we can communicate public within this issue.

Hey @ivanvovk .
I've got an error on training while epoch 19.

100%|#####################################################################################################################################| 97/97 [00:27<00:00,  3.48it/s]
Device: GPU. average_rtf=4.038705106806669
Epoch: 18 | Losses: [0.49297845363616943, 0.07294661551713943, 2.0598750710487366]
100%|#####################################################################################################################################| 97/97 [00:27<00:00,  3.48it/s]
Device: GPU. average_rtf=4.033360881088442
Epoch: 19 | Losses: [nan, nan, nan]
/usr/local/lib/python3.6/dist-packages/torch/utils/tensorboard/summary.py:422: RuntimeWarning: invalid value encountered in greater
  if abs(tensor).max() > 1:
Traceback (most recent call last):
  File "train.py", line 262, in <module>
    run_training(0, config, args)
  File "train.py", line 198, in run_training
    logger.log_audios(epoch, audios)
  File "/wavegrad/logger.py", line 55, in log_audios
    self.summary_writer.add_audio(key, audio, iteration, sample_rate=self.sample_rate)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/tensorboard/writer.py", line 676, in add_audio
    audio(tag, snd_tensor, sample_rate=sample_rate), global_step, walltime)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/tensorboard/summary.py", line 427, in audio
    tensor_list = [int(32767.0 * x) for x in tensor]
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/tensorboard/summary.py", line 427, in <listcomp>
    tensor_list = [int(32767.0 * x) for x in tensor]
ValueError: cannot convert float NaN to integer
Segmentation fault (core dumped)

Tensorboard was running while this error occurs. Is a running tb a problem?
Do you have any idea what might be the reason or do you need more info from me?

I have the same error when training vietnamese dataset.

Device: GPU. average_rtf=0.39042793247007823
Epoch: 17 | Losses: [nan, 0.49476008117198944, 3.333707571029663]
Device: GPU. average_rtf=0.43076214979387495
Epoch: 18 | Losses: [nan, nan, nan]
Traceback (most recent call last):
File "train.py", line 264, in
run_training(0, config, args)
File "train.py", line 198, in run_training
logger.log_audios(epoch, audios)
File "/data/cuongnm5/WaveGrad/logger.py", line 55, in log_audios
self.summary_writer.add_audio(key, audio, iteration, sample_rate=self.sample_rate)
File "/root/miniconda3/envs/cuongnm/lib/python3.7/site-packages/torch/utils/tensorboard/writer.py", line 676, in add_audio
audio(tag, snd_tensor, sample_rate=sample_rate), global_step, walltime)
File "/root/miniconda3/envs/cuongnm/lib/python3.7/site-packages/torch/utils/tensorboard/summary.py", line 427, in audio
tensor_list = [int(32767.0 * x) for x in tensor]
File "/root/miniconda3/envs/cuongnm/lib/python3.7/site-packages/torch/utils/tensorboard/summary.py", line 427, in
tensor_list = [int(32767.0 * x) for x in tensor]
ValueError: cannot convert float NaN to integer

My config:

"batch_size": 96,
"segment_length": 7200,
"lr": 5e-4,
"grad_clip_threshold": 1,
"scheduler_step_size": 1,
"scheduler_gamma": 0.9,
"n_epoch": 10000,
"n_samples_to_test": 4,
"test_interval": 1

@dodoproptit99 Hello. Okay, that seems like a problem of pytorch mixed-precision training. I've just pushed a small update to the repo, where I added support to turn it off. Please pull new version, disable fp16-training here and decrease batch size (I suggest 48). I suppose it should help. And please report if it helps or not.

@ivanvovk Thanks for your reply! I try to decrease batch size to 48, 24 and disable fp16-training but i still got this error :(
I use RTX 2080Ti with CUDA 10.2.

Screenshot from 2020-11-01 14-43-38

@dodoproptit99 what tensorboard output do you have?

@ivanvovk

  • logs/default_2:

{"model_config": {"factors": [5, 5, 3, 2, 2], "upsampling_preconv_out_channels": 768, "upsampling_out_channels": [512, 512, 256, 128, 128], "upsampling_dilations": [[1, 2, 1, 2], [1, 2, 1, 2], [1, 2, 4, 8], [1, 2, 4, 8], [1, 2, 4, 8]], "downsampling_preconv_out_channels": 32, "downsampling_out_channels": [128, 128, 256, 512], "downsampling_dilations": [[1, 2, 4], [1, 2, 4], [1, 2, 4], [1, 2, 4]]}, "data_config": {"sample_rate": 16000, "n_fft": 1024, "win_length": 1024, "hop_length": 300, "f_min": 80.0, "f_max": 8000, "n_mels": 80}, "training_config": {"logdir": "logs/default_2", "continue_training": false, "train_filelist_path": "filelists/train.txt", "test_filelist_path": "filelists/test.txt", "batch_size": 48, "segment_length": 7200, "lr": 0.0001, "grad_clip_threshold": 1, "scheduler_step_size": 1, "scheduler_gamma": 0.9, "n_epoch": 10000, "n_samples_to_test": 4, "test_interval": 1, "use_fp16": false, "training_noise_schedule": {"n_iter": 1000, "betas_range": [1e-06, 0.01]}, "test_noise_schedule": {"n_iter": 50, "betas_range": [1e-06, 0.01]}}, "dist_config": {"MASTER_ADDR": "localhost", "MASTER_PORT": "600010"}}

  • logs/default_3:

{"model_config": {"factors": [5, 5, 3, 2, 2], "upsampling_preconv_out_channels": 768, "upsampling_out_channels": [512, 512, 256, 128, 128], "upsampling_dilations": [[1, 2, 1, 2], [1, 2, 1, 2], [1, 2, 4, 8], [1, 2, 4, 8], [1, 2, 4, 8]], "downsampling_preconv_out_channels": 32, "downsampling_out_channels": [128, 128, 256, 512], "downsampling_dilations": [[1, 2, 4], [1, 2, 4], [1, 2, 4], [1, 2, 4]]}, "data_config": {"sample_rate": 16000, "n_fft": 1024, "win_length": 1024, "hop_length": 300, "f_min": 80.0, "f_max": 8000, "n_mels": 80}, "training_config": {"logdir": "logs/default_3", "continue_training": false, "train_filelist_path": "filelists/train.txt", "test_filelist_path": "filelists/test.txt", "batch_size": 24, "segment_length": 7200, "lr": 0.0001, "grad_clip_threshold": 1, "scheduler_step_size": 1, "scheduler_gamma": 0.9, "n_epoch": 10000, "n_samples_to_test": 4, "test_interval": 1, "use_fp16": false, "training_noise_schedule": {"n_iter": 1000, "betas_range": [1e-06, 0.01]}, "test_noise_schedule": {"n_iter": 50, "betas_range": [1e-06, 0.01]}}, "dist_config": {"MASTER_ADDR": "localhost", "MASTER_PORT": "600010"}}

  • logs/default:

{"model_config": {"factors": [5, 5, 3, 2, 2], "upsampling_preconv_out_channels": 768, "upsampling_out_channels": [512, 512, 256, 128, 128], "upsampling_dilations": [[1, 2, 1, 2], [1, 2, 1, 2], [1, 2, 4, 8], [1, 2, 4, 8], [1, 2, 4, 8]], "downsampling_preconv_out_channels": 32, "downsampling_out_channels": [128, 128, 256, 512], "downsampling_dilations": [[1, 2, 4], [1, 2, 4], [1, 2, 4], [1, 2, 4]]}, "data_config": {"sample_rate": 16000, "n_fft": 1024, "win_length": 1024, "hop_length": 300, "f_min": 80.0, "f_max": 8000, "n_mels": 80}, "training_config": {"logdir": "logs/default", "continue_training": false, "train_filelist_path": "filelists/train.txt", "test_filelist_path": "filelists/test.txt", "batch_size": 96, "segment_length": 7200, "lr": 0.0001, "grad_clip_threshold": 1, "scheduler_step_size": 1, "scheduler_gamma": 0.9, "n_epoch": 10000, "n_samples_to_test": 4, "test_interval": 1, "use_fp16": true, "training_noise_schedule": {"n_iter": 1000, "betas_range": [1e-06, 0.01]}, "test_noise_schedule": {"n_iter": 50, "betas_range": [1e-06, 0.01]}}, "dist_config": {"MASTER_ADDR": "localhost", "MASTER_PORT": "600010"}}

Screenshot from 2020-11-01 17-15-32
Screenshot from 2020-11-01 17-15-19

@dodoproptit99 this is really strange. Can you please run the following script? Put it to the root folder of WaveGrad and run python check_data.py -c configs/YOUR_CONFIG -f filelists/YOUR_FILELIST. It will check whether mel-transformation results in bad values (infs or nans). Of course, don't forget to specify your GPU device through CUDA_VISIBLE_DEVICES.

@dodoproptit99 this is really strange. Can you please run the following script? Put it to the root folder of WaveGrad and run python check_data.py -c configs/YOUR_CONFIG -f filelists/YOUR_FILELIST. It will check whether mel-transformation results in bad values (infs or nans). Of course, don't forget to specify your GPU device through CUDA_VISIBLE_DEVICES.

This is my output:
Dataset has nans: False
Dataset has infs: True

Can u tell me more about that?
Thanks in advance ^^

@dodoproptit99 Okay, I found the problem origin. Seems like your data contains audios of length less than segment_length (=7200 by default configuration). For such cases proper batching is resolved using padding with zeros. When transforming to mel-spectrogram, I take log10, which results in infinity values. I have just pushed an update which should solve this problem. Check it out by pulling latest repo changes. Please, report.

@ivanvovk It work ^^

Screenshot from 2020-11-02 17-02-04

@dodoproptit99 glad to hear that! @thorstenMueller also check it out, probably it will solve your problem with NaNs too (if its still relevant).

Thank's @ivanvovk for triggering me and updating code.
I've no problems with "NaN" currently. I'm having trouble on this:

Initializing logger...
Initializing model...
Number of WaveGrad parameters: 15810401
Initializing optimizer, scheduler and losses...
Initializing data loaders...
Start training...
Traceback (most recent call last):                                                                                                                                        
  File "train.py", line 262, in <module>
    run_training(0, config, args)
  File "train.py", line 117, in run_training
    loss = (model if args.n_gpus == 1 else model.module).compute_loss(mels, batch)
  File "/wavegrad/model/diffusion_process.py", line 176, in compute_loss
    eps_recon = self.nn(mels, y_noisy, continuous_sqrt_alpha_cumprod)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/wavegrad/model/nn.py", line 119, in forward
    ublock_outputs = ublock(x=ublock_outputs, scale=scale, shift=shift)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/wavegrad/model/upsampling.py", line 82, in forward
    outputs = self.first_block_main_branch['modulation'](outputs, scale, shift)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/wavegrad/model/upsampling.py", line 30, in forward
    outputs = self.featurewise_affine(x, scale, shift)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/wavegrad/model/linear_modulation.py", line 68, in forward
    outputs = scale * x + shift
RuntimeError: The size of tensor a (450) must match the size of tensor b (448) at non-singleton dimension 2
Segmentation fault (core dumped)

I'd like to try your tip, but i'm not sure how to do this:

New hop length, new struggles. Check, does mel spectrogram shape you obtain corresponds to the audio length or not. Take audio and convert it using this class. Mel length multiplied by 256 should be equal to audio length exactly.

@thorstenMueller oh, sorry, I got what's wrong. Besides upsampling factors you also need to update segment length, that should be divisible by hop length. Change segment_length in your config to 7168, for example, and it will work.

Thanks @ivanvovk .
I'll give it a try soon and report back to you.

Hey @ivanvovk .
Training is running for 12 hours without any problems (epoch 10).
Graphs and audio samples looks/sounds good.

Next point will be checking if generated melspecs are compatible with mozilla TTS project.

So thanks for your support and updates on this 👍 .
Scalars
melspecs

@thorstenMueller glad that it works and you're welcome! Closing this issue.