Does the model take reference audio for TTS like Coqui TTS(Zero shot)

Question

Does the model take reference audio for TTS like Coqui TTS(Zero shot)

Closed this issue 2 years ago · 1 comments

I saw that you have mentioned CoquiTTS as the code reference in the Readme.md. Does your model take a reference audio wav file as input along with the text and produce speech in that voice?

Answer 1 · 2023-02-16T15:36:52.000Z

No, we trained speaker embeddings by one-hot encoding the speaker id in the dataset. So, it won't directly work for unseen speakers. However, you could try making it work for the unseen reference audio by adding the neural network with audio input that approximates our model's speaker embedding layer output with the speaker id input.