heatz123/naturalspeech

About sample quality

Opened this issue · 1 comments

Hi @heatz123

Hope you are doing well.
I am interest to know how the overall sample quality either with or without soft-dtw loss, is it as good as paper claim or sample shared by author.
How the quality compared to VITS ?

Thanks

Hi @rishikksh20,

Thank you for reaching out.

To answer your question, I am unable to give a definite answer at this time, as I haven't trained this model to the end.

From my training experience of 1.5k epochs (~150k iterations) without the soft-dtw loss, I would say that the overall sample quality is satisfactory with the majority of samples sound natural. However, a level of "no statistically significant difference from human recordings" seems not to be achieved with 1.5k epochs, which is only 1/10th of the original paper's training of 15k epochs.

I am working on sharing some demos or a pretrained model in the near future (possibly in weeks), so that you can evaluate the sample quality for yourself. Additionally, I'm considering further training to reproduce the results of the paper, once I have access to more resources.

I hope this helps. Thanks.