janvainer/speedyspeech

Issues for multi_speedyspeech?

Closed this issue · 5 comments

Hi, everybody. Has anyone tried multispeaker strategy for speedyspeech?
I have tried to modify the model in chinese, but the effect is very poor, and even the Duration Prediction Module cannot predict well.
Have you encountered the same problems?

Hi @TaoTaoFu, I have not tried that yet, but would be very interested. How dd you approacch this? Did you use some speaker embeddings? Or did you simply feed the model with multiple speakers without any speaker information?

Yes. I use the speaker embedding of GE2E method to do this. I have tried ADD or CONCATE the speaker embedding with the KEY&VALUES, but it not work. :(

One more question. Why is the hidden dimension in the code so small? Have you tried the effect of increasing the dimension?

I would need to see your code and how exactly are the embeddings used. Simply adding or concatenating the embeddings in the attention layer is not likely to work. The model should be biased by the embedding at each layer of the network. You could try to use the embedding as a bias in the convolutional layers, or integrate it with the gated (WaveNet-like) blocks in a similar way like Deep Voice 3 - this option is most likely to work because the DeepVoice 3 architecture is quite similar to the Duration predictor so i would start from there.

The hidden dimension in the attention is small to force the model to correctly learn the attention alignment. I experimented with larger dimensions, but there seems to be too much information in the input audio and the model had no incentive to use the information from the input phonemes.