train_second.py model.decoder error (output tensor is nan)

Question

train_second.py model.decoder error (output tensor is nan)

Opened this issue 10 months ago · 5 comments

The g_loss value in "train_second.py" is nan.
Debugging found that the output value of the model.decoder() function was nan. (line 391, line 402)
There was no problem in train_first.py, but I don't know why this problem occurs in train_second.py.

If you can fix these errors, please help me.
Thank you.

log_dir: "C:\Users\user_\Desktop\styleTTS2_test_data"
first_stage_path: "first_stage.pth"
save_freq: 2
log_interval: 10
device: "cuda"
epochs_1st: 200 # number of epochs for first stage training (pre-training)
epochs_2nd: 100 # number of peochs for second stage training (joint training)
batch_size: 4
max_len: 200 # maximum number of frames
pretrained_model: "C:\Users\user_\Desktop\styleTTS2_test_data\epoch_1st_00170.pth"
second_stage_load_pretrained: true # set to true if the pre-trained model is for 2nd stage
load_only_params: true # set to true if do not want to load epoch numbers and optimizer parameters

config for decoder

decoder:
type: 'istftnet' # either hifigan or istftnet
resblock_kernel_sizes: [3,7,11]
upsample_rates : [10, 6]
upsample_initial_channel: 512
resblock_dilation_sizes: [[1,3,5], [1,3,5], [1,3,5]]
upsample_kernel_sizes: [20, 12]
gen_istft_n_fft: 20
gen_istft_hop_size: 5

Answer 1 · 2024-01-28T07:31:37.000Z

Same issue

Answer 2 · 2024-01-29T20:00:06.000Z

I have experienced this before in a few situations:

actual model parameters are not being loaded from the checkpoint (there is some weird naming error involving "module" prefix between stages 1 and 2 & whether you are using distributed vs. non-distributed training; try changing strict loading to true and see what happens with keys)
multispeaker is set incorrectly
certain batch sizes with mixed precision (try changing batch sizes)

Answer 3 · 2024-03-07T04:41:48.000Z

Have you checked whether F0_fake, N_fake, s or en are all not NaN?

Answer 4 · 2024-05-12T05:36:27.000Z

Have you checked whether F0_fake, N_fake, s or en are all not NaN?

All the above are not NAN
The problem starts with model.decoder() process where y_rec_gt_pred becomes nan even though the arguments are not nan.

Answer 5 · 2024-09-01T17:18:43.000Z

I know this is an old issue but after reading through the config, something seems off to me. This section specifically:

pretrained_model: "C:\Users\user_\Desktop\styleTTS2_test_data\epoch_1st_00170.pth"
second_stage_load_pretrained: true # set to true if the pre-trained model is for 2nd stage
load_only_params: true # set to true if do not want to load epoch numbers and optimizer parameters

You seem to be loading a 1st stage model (pretrained_model: "C:\Users\user_\Desktop\styleTTS2_test_data\epoch_1st_00170.pth" but telling the script that it's actually a pretrained model from the 2nd stage and you want to resume training on it (second_stage_load_pretrained: true). Also, I never used load_only_params: true, so that might be another culprit but perhaps it'll work with it as well.

If you're just starting a 2nd stage training, I'd change the config to:

pretrained_model: ""
second_stage_load_pretrained: true # set to true if the pre-trained model is for 2nd stage
load_only_params: false # set to true if do not want to load epoch numbers and optimizer parameters

Also, since you seem to have 1st stage training only trained to epoch_1st_00170.pth (so perhaps the final first_stage.pth file was not generated), you might also need to change the following line: first_stage_path: "first_stage.pth" to first_stage_path: "C:\Users\user_\Desktop\styleTTS2_test_data\epoch_1st_00170.pth"

Hope this helps.