yl4579/StyleTTS

Any-to-any and emotion examples

Closed this issue · 6 comments

I'm trying to replicate your results. How did you create your any-to-any examples? Did you change the voice to text from the "source" audio and use the "reference" audio as a zero-shot reference?

Similarly, in the emotion examples, were those also zero-shot where you just used the file from ESD as a reference to create the reference embedding?

Thanks!

yl4579 commented

Any-to-Any was voice conversion. The idea is you provide the text and use the text aligner to get the attention alignment, giving reference audio of the target speaker, and use the alignment and text to "resynthesize" the speech of the target speaker. The idea is carried to the project StyleTTS-VC: https://github.com/yl4579/StyleTTS-VC

The emotion examples are not zero-shot but seen speakers from ESD. It won't be this good if you do zeroshot unfortunately.

Can you tell me about the zero shot examples? Were the texts already in the training dataset?

yl4579 commented

What do you mean by text?

I mean the text that is producing the speech, was it part of the training set (with the accompanying audio), or was it a new text that wasn't seen in the training?

for example "The difference in the rainbow depends considerably upon the size of the drops, and the width of the colored band increases as the size of the drops increases.", and it has a "GT" audio. Were those part of the training dataset or not?

yl4579 commented

Both the texts and speakers were unseen during training if that is what you mean. The GT audio is just a reference for you to compare the model performance and the ground truth.