soobinseo/Transformer-TTS

Repeating words at the end of sentences

Opened this issue · 7 comments

When inputting long sentences, I found the model tended to repeat the ending words over and over again. I trained this model over the blizzard2011 challenge database and both of the transformer and postnet were trained for over 500k iterations. The loss function looked like it converged pretty well.... Has anyone also came across this problem? Please guide my somehow to fix this problem

Do you add "stop token" predicetion,I also meet this problem.

Do you add "stop token" predicetion,I also meet this problem.

I did not add stop token loss for the first try. However, even after adding the stop token prediction, the model also tends to repeat ending words if the checkpoints are not carefully chosen. Also, I found it took more time to converge after adding the stop token prediction. Did you try to add a stop token prediction and how did that work out?

Here are my codes for calculating stop token loss.
stop_tokens = t.abs(pos_mel.ne(0).type(t.float) - 1).cuda() pos_mask = t.sum(pos_mel.ne(0),1) pos_w_matrix = t.zeros(pos_mel.size()) for i in range(pos_w_matrix.size()[0]): pos_w_matrix[i, pos_mask[i]] = 7. pos_w_matrix = pos_w_matrix.cuda() stop_tokens_loss = nn.BCEWithLogitsLoss(pos_weight=pos_w_matrix)(stop_preds, stop_tokens)
I used a separate optimizer to train the stop token linear projection parameters.

@TakoYuxin what's the batchsize number in your hparameters, and how many step can get intelligence speech result.

@TakoYuxin what's the batchsize number in your hparameters, and how many step can get intelligence speech result.

The batchsize is 16 and it took about 100K steps to get intelligible speech results without calculating the stop token loss. However, several words may be repeated or ignored in the result sentences.

Thanks for your reply! Have you used multi-GPU training style? I am training in a single gpu with 16 samples per step. And I can't get intelligence speech result.

Thanks for your reply! Have you used multi-GPU training style? I am training in a single gpu with 16 samples per step. And I can't get intelligence speech result.

That's weird. I was also training on a single GPU. Did you change any other hparams?

stop tokens in pad should be masked right? Because the padded mels will make the prediction of pad stop tokens easy. I think thats why its not learning the stop token