loss issues encountered in fine-tuning the model
wangtao201919 opened this issue · 6 comments
The values of the training loss curves do look a little strange, the generator loss probably should be increasing so much. But it is a little hard to say since the training loss dynamics of GANs can be a bit tricky to interpret. The validation mel spec error also ideally should be decreasing, but again it is a bit hard to tell. My advice is to listen to the samples generated every 20k steps or so and assess whether it sounds better or not.
To give some comparison, here is what the training curves look like for our prematched vocoder trained on librispeech for the first 1M steps:
Aside from the generator loss, our plots and yours look fairly similar.
I hope that helps a bit!
Thank you very much for your response and suggestions. I have tested the fine-tuning model, and the results are very significant. However, I have found that the generalization in cross-lingual conversion is not stable enough, just as you mentioned in your paper, "how far away can the reference utterances be from the training distribution?" I suspect that it might be because the distance between the features of the target language and the reference language cannot be measured as accurately as within the same language.
Ahh I see. Yeah for languages that are very very different from English, one might need to fine-tune the WavLM encoder as well on that language to allow it to better represent it in the feature space. Without this, you are probably right that the distance comparisons between features of different languages is not as ideal as within the same language.
Hi @egorsmkv , unfortunately not that we know of. There might be some useful resources for that in the original microsoft repo, but I don't think they ever open sourced their training code.
Hi @wangtao201919 , did you fine-tune wavLM for Chinese ? If not, did you obtain good results with just fine-tuning vocoder?