syang1993/gst-tacotron

Training Multi-Speaker Model.

sujithpadar opened this issue · 5 comments

Hi,
I had success while using this model for single speaker data,
But I am not sure how to scale it for a multi-speaker setting.
Is it just changing the data or should there be some changes to the model?
Have anyone tried this before??

@sujithpadar I modified this code to train a multi-speaker model using 4 speakers and it works. If you want to train multi-speaker model, you need to add an extra speaker embedding, and feed speaker information into model. You can see another paper " Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron".

@syang1993 Thanks a lot for the quick response, Really appreciate it.
Looking at the paper I could gather that,

  1. I need to concatenate speaker embedding to the encoder output and there is no change in the Cost function.
  2. I need to add the speaker Identifier for the model as well.

I found such an implementation here,
https://github.com/keithito/tacotron/tree/multispeaker

Can I take this as the reference make necessary modifications?
Sorry to bug you, as I'm new to deep learning.
Or better if you have the implementation of that handy, can you share it?

Thanks!!!

@sujithpadar Hi, you are right. You only need to add speaker embedding and concat it with style embedding. Thus you only need to modify dataset.py, tacotron.py and small changes in train.py.

I'm sorry that now I'm interning at a company, so now I cannot upload any code about my work .

@syang1993 Thanks a lot, I think I'll be able to manage that.

@sujithpadar, did you succeed in developing a multi speaker version? If yes, can you share it?