train_second.py model.decoder error (output tensor is nan)
Opened this issue · 5 comments
The g_loss value in "train_second.py" is nan.
Debugging found that the output value of the model.decoder() function was nan. (line 391, line 402)
There was no problem in train_first.py, but I don't know why this problem occurs in train_second.py.
If you can fix these errors, please help me.
Thank you.
log_dir: "C:\Users\user_\Desktop\styleTTS2_test_data"
first_stage_path: "first_stage.pth"
save_freq: 2
log_interval: 10
device: "cuda"
epochs_1st: 200 # number of epochs for first stage training (pre-training)
epochs_2nd: 100 # number of peochs for second stage training (joint training)
batch_size: 4
max_len: 200 # maximum number of frames
pretrained_model: "C:\Users\user_\Desktop\styleTTS2_test_data\epoch_1st_00170.pth"
second_stage_load_pretrained: true # set to true if the pre-trained model is for 2nd stage
load_only_params: true # set to true if do not want to load epoch numbers and optimizer parameters
config for decoder
decoder:
type: 'istftnet' # either hifigan or istftnet
resblock_kernel_sizes: [3,7,11]
upsample_rates : [10, 6]
upsample_initial_channel: 512
resblock_dilation_sizes: [[1,3,5], [1,3,5], [1,3,5]]
upsample_kernel_sizes: [20, 12]
gen_istft_n_fft: 20
gen_istft_hop_size: 5
Same issue
I have experienced this before in a few situations:
- actual model parameters are not being loaded from the checkpoint (there is some weird naming error involving "module" prefix between stages 1 and 2 & whether you are using distributed vs. non-distributed training; try changing strict loading to true and see what happens with keys)
- multispeaker is set incorrectly
- certain batch sizes with mixed precision (try changing batch sizes)
Have you checked whether F0_fake
, N_fake
, s
or en
are all not NaN
?
Have you checked whether
F0_fake
,N_fake
,s
oren
are all notNaN
?
All the above are not NAN
The problem starts with model.decoder() process where y_rec_gt_pred becomes nan even though the arguments are not nan.
I know this is an old issue but after reading through the config, something seems off to me. This section specifically:
pretrained_model: "C:\Users\user_\Desktop\styleTTS2_test_data\epoch_1st_00170.pth"
second_stage_load_pretrained: true # set to true if the pre-trained model is for 2nd stage
load_only_params: true # set to true if do not want to load epoch numbers and optimizer parameters
You seem to be loading a 1st stage model (pretrained_model: "C:\Users\user_\Desktop\styleTTS2_test_data\epoch_1st_00170.pth"
but telling the script that it's actually a pretrained model from the 2nd stage and you want to resume training on it (second_stage_load_pretrained: true
). Also, I never used load_only_params: true
, so that might be another culprit but perhaps it'll work with it as well.
If you're just starting a 2nd stage training, I'd change the config to:
pretrained_model: ""
second_stage_load_pretrained: true # set to true if the pre-trained model is for 2nd stage
load_only_params: false # set to true if do not want to load epoch numbers and optimizer parameters
Also, since you seem to have 1st stage training only trained to epoch_1st_00170.pth (so perhaps the final first_stage.pth
file was not generated), you might also need to change the following line: first_stage_path: "first_stage.pth"
to first_stage_path: "C:\Users\user_\Desktop\styleTTS2_test_data\epoch_1st_00170.pth"
Hope this helps.