unilight/seq2seq-vc

Repetitive generation and low probability problem in arctic/vc1 model training

leelee724 opened this issue · 4 comments

Hello, I'm having some problems with arctic/vc1/. I used the code and parameters you provided for training, and used the same dataset without modifying any parameters, only reducing the batch_size(100->64). But in the end, some of the converted wav files have the problem of repetitive generation, i.e. the model doesn't seem to stop generating the sequence at the right frame, but continues to generate until the maximum length of the sequence. In those problematic wav files, the probs plot shows a probability of almost 0 or less than the threshold (0.5).
This happens no matter if I use TTS_aept to fine tune the vc model or train the vc model from scratch.
Please tell me what I am doing wrong or if this is normal, thank you very much!

By the way, the BCEWithLogitsLoss during training quickly approaches 0, around 0.1 at about 190steps, and then the Loss is around 0.00xx until the end of training.
BCE_Loss=0.0000 and L1_Loss=0.2636 at 50000steps.
Thank you!

Hi @leelee724 did you calculate the CER/WER? What were the numbers?

Use TTS_aept to fine-tune the VC model. In the eval set, CER=10.4 & WER=15.2. But in your paper, CER=2.4 & WER=6.3. There's a bit of a gap between the two, so I'm curious if I'm doing something wrong.

For example, there is a sentence with the following evaluation result:
arctic_b0462 9.62 18.75 0.48 1.36 75.8 83.3 ONE GUESS WILL DO ERNEST RETORTED WORTED ONE GUESS WILL DO | ONE GUESS WILL DO ERNEST RETORTED

The CER=75.8 and WER=83.3 of this sentence.
The content of the wav file generated by the model is "ONE GUESS WILL DO ERNEST RETORTED WORTED ONE GUESS WILL DO"
The ground truth is "ONE GUESS WILL DO ERNEST RETORTED"

The model repeats "WORTED ONE GUESS WILL DO" at the end of the sequence, but I don't know why.

Sorry, thank you for your reply!

Hi @leelee724 which config file did you use? Was num_train set to 932? And did you find this problem in a lot of samples?

Just to clarify, there are some modifications in this repo compared to my original code that was used to generate the results in the journal paper. The results I can get in this repo is: (with tts aept)

MCD/CER/WER
num_train=932 6.43/5.8/9.7
num_train=80 7.07/11.1/15.4

Hi, I'm using vtn.tts_pt.v1.yaml with num_train=932. out of 100 samples in eval_set, about 1/4 of them have the problem of repeated generation (some repeat only one word, some repeat multiple words).

My current situation is that the probability of stopping sequence generation in wrong sentences is very low (observe the probs graph), so the model will not stop generating sequences, which leads to repeat generation of sequences (words).

Currently, I have changed maxlenratio from the default 6.0 to 4.0 (only maxlenratio and batch_size are changed) to limit the maximum length of the generated sequences, so that the maximum length becomes shorter, which will make the CER and WER lower (although there is still the problem of repeated generation, but because of the maximum length limit, the part of the repeated sequences becomes less, so the CER and WER are lower).

Now, I would like to know the following questions.

  1. In eval_set, is it normal to have repeat generation / low probability of model stopping generation? ( whatever the number of samples with errors )
  2. bce_pos_weight and maxlenratio, are they calculated in some way, or are they based on your experience?
  3. As for commitment 2, if we change the num_train or dataset, do we need to change these two parameters accordingly?

Sorry for the many questions, but thank you very much for your reply!