p0p4k/vits2_pytorch

Training code error

WendongGan opened this issue Β· 30 comments

Hi, p0p4k, Thanks for sharing the code. It's a great project. I have been following this project for a long time and have tried it many times. But when I run the training code, I still get the following error. I guess some parameters were passed wrong in the code. The actual parameters are not completely obtained from vits2_ljs_base.json. I tried to debug and modify it, but it didn't work. Look forward to your review and reply.

when i run :
python train.py -c configs/vits2_ljs_base.json -m ljs_base

i meet the error:
-- Process 0 terminated with the following error: Traceback (most recent call last): File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 157, in run train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval]) File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 191, in train_and_evaluate (z, z_p, m_p, logs_p, m_q, logs_q) = net_g(x, x_lengths, spec, spec_lengths) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward output = self._run_ddp_forward(*inputs, **kwargs) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward return module_to_run(*inputs[0], **kwargs[0]) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/models.py", line 748, in forward z, m_q, logs_q, y_mask = self.enc_q(y, y_lengths, g=g) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/models.py", line 495, in forward x = self.pre(x) * x_mask File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 313, in forward return self._conv_forward(input, self.weight, self.bias) File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 309, in _conv_forward return F.conv1d(input, weight, bias, self.stride, RuntimeError: Given groups=1, weight of size [192, 80, 1], expected input[32, 513, 298] to have 80 channels, but got 513 channels instead

If any of you have solved this problem, I look forward to sharing your solutions. Thank you very much!

p0p4k commented

Hello, I made a really silly mistake. Please try the latest patch and let me know.
In train.py, I was supposed to modify hps.data.use_mel_posterior_encoder based on hps.model.use_mel_posterior_encoder before passing hps.data to the dataloader.
However, I loaded the dataloader first, which generates Linear-Spectrograms of 513 channels and then the model paramters load a model that accepts Mel-Spectrograms of 80 channels, and then I modify the hps.data params (which is never used; since dataloader already loaded).
I fixed the order and also added an additional flag in hps.data just to be sure for now. I will do a clean up to avoid model and data parameters mismatch (minor stuff) later on.
Thanks.

Thank you very much. I will try this latest code and report back the results。

p0p4k commented

@UESTCgan wait i think there is minor bug yet. Fixing it now.
Fixed multi-speaker loader as well. Should be good to go.

I tried the latest code, just submitted an hour ago, ca7e41d.

when i run :
python train.py -c configs/vits2_ljs_base.json -m ljs_base

i meet the error:

Traceback (most recent call last):
File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 338, in
main()
File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 51, in main
mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,))
File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 158, in run
train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval])
File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 194, in train_and_evaluate
mel = spec_to_mel_torch(
File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/mel_processing.py", line 85, in spec_to_mel_torch
spec = torch.matmul(mel_basis[fmax_dtype_device], spec)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (9536x80 and 513x80)

p0p4k commented

Haha, of course. I am making so many silly mistakes. Fixing it right now.

p0p4k commented

Fixed. @UESTCgan Really thanks for letting me know the errors. These feedbacks are really helpful!
Let's get the model working ASAP!

p0p4k commented

explanation for the bug :
After generating the wav output (wav_pred), the model converts the wav_pred to mel-spec for comparing with mel-spec of wav_real. The mel-spec is obtained from lin-spec used as input in VITS-1 model.
However, in VITS-2 we use mel-spec and so the bug occurs in trying to convert mel-spec to mel-spec (????). We must directly use the mel-spec that was input in the model and compare with wav_pred's mel-spec.

I tried the latest code, ee1c94d

when i run :
python train.py -c configs/vits2_ljs_base.json -m ljs_base

i meet the error:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 158, in run
train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval])
File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 192, in train_and_evaluate
(z, z_p, m_p, logs_p, m_q, logs_q) = net_g(x, x_lengths, spec, spec_lengths)
File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1026, in forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by
making sure all forward function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 776 777 778 779 ...
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

Currently I am using pytorch==1.13. Do I have to use Pytorch version 2.0?

p0p4k commented

Can you try "use_noise_scaled_mas=False" in the config and run the training? Thanks.

when i set "use_noise_scaled_mas=False" in the config

i meet the error:

Traceback (most recent call last):
File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 344, in
main()
File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 51, in main
mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,))
File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 126, in run
mas_noise_scale_initial = mas_noise_scale_initial,
UnboundLocalError: local variable 'mas_noise_scale_initial' referenced before assignment

p0p4k commented

Updated. Thanks.

p0p4k commented

I am downloading data and trying to train one step and check the previous error regarding loss.

Currently I am using pytorch==1.13. Do I have to use Pytorch version 2.0?

This issue does not seem to be related to pytorch version. I still have this problem with pytorch 2.0.

p0p4k commented

But after the latest update and use_noise_scaled_mas=False ?

But after the latest update and use_noise_scaled_mas=False ?

Traceback (most recent call last):
File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 345, in
main()
File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 51, in main
mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,))
File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 128, in run
noise_scale_delta = noise_scale_delta,
UnboundLocalError: local variable 'noise_scale_delta' referenced before assignment

p0p4k commented

Check again. Thanks.

Check again. Thanks.

when i try 30adb2d

i meet the error:
Traceback (most recent call last):
File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 346, in
main()
File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 51, in main
mp.spawn(run, nprocs=n_gpus, args=(n_gpus, hps,))
File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 160, in run
train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval])
File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 194, in train_and_evaluate
(z, z_p, m_p, logs_p, m_q, logs_q) = net_g(x, x_lengths, spec, spec_lengths)
File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1026, in forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by
making sure all forward function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 776 777 778 779 ...
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

p0p4k commented

Okay. Then give me 2 hours, I will fix the bug and let you know. Thanks.

Okay. Then give me 2 hours, I will fix the bug and let you know. Thanks.

Thank you very much! Looking forward to your update.

p0p4k commented

I added the findunusedparams thingy. Tell me what the error says now. That can help me update the code. Thanks.

after trained some steps:
image

i meet the error:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 161, in run
train_and_evaluate(rank, epoch, hps, [net_g, net_d], [optim_g, optim_d], [scheduler_g, scheduler_d], scaler, [train_loader, eval_loader], logger, [writer, writer_eval])
File "/data7/ganwendong/vits2_pytorch/test4-0817/vits2_pytorch/train.py", line 244, in train_and_evaluate
scaler.scale(loss_gen_all).backward()
File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/root/anaconda3/envs/vits2_gwd2/lib/python3.9/site-packages/torch/autograd/init.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([16, 192, 1, 123], dtype=torch.half, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(192, 768, kernel_size=[1, 3], padding=[0, 0], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().half()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams
memory_format = Contiguous
data_type = CUDNN_DATA_HALF
padding = [0, 0, 0]
stride = [1, 1, 0]
dilation = [1, 1, 0]
groups = 1
deterministic = false
allow_tf32 = true
input: TensorDescriptor 0x562a218aa350
type = CUDNN_DATA_HALF
nbDims = 4
dimA = 16, 192, 1, 123,
strideA = 23616, 123, 123, 1,
output: TensorDescriptor 0x7f544aefb2e0
type = CUDNN_DATA_HALF
nbDims = 4
dimA = 16, 768, 1, 121,
strideA = 92928, 121, 121, 1,
weight: FilterDescriptor 0x7f544aefaa60
type = CUDNN_DATA_HALF
tensor_format = CUDNN_TENSOR_NCHW
nbDims = 4
dimA = 768, 192, 1, 3,
Pointer addresses:
input: 0x7f546e400000
output: 0x7f5685917000
weight: 0x7f54676b5800

i am trying pytorch 2.0. maybe, it works.

image

I'm training LJSpeech. If I have some results tomorrow, I'll give them back. Thanks again for the update!

p0p4k commented

Hello, does the training work well now? And can you post me your config file? Do you have discord? (add me on discord : p0p4k)

p0p4k commented

Hi, I tried using Pytorch==1.13.1 and the training worked for me. I suggest using the same version.

Hello, does the training work well now? And can you post me your config file? Do you have discord? (add me on discord : p0p4k)

3c5b155
I train in pytorch 2.0.1, there is no problem. i did not change configure

p0p4k

ok

Hello, does the training work well now? And can you post me your config file? Do you have discord? (add me on discord : p0p4k)

my discord is hepanqingge#8740, i have added you.