heatz123/naturalspeech

Quality comparison to the original implementation

Opened this issue · 1 comments

Thank you for your great work! I really hope you get approval for publishing the models.
In your notes you write:

  • This implementation does not include pre-training of phonemes using a large-scale text corpus from the news-crawl dataset.

Does this mean the quality here will be worse?

Hi @dreamflasher, thank you for your attention.
Unfortunately, I'm sorry to inform you that I might need to retrain using my own GPU to publish the model, which would take some time.

  • This implementation does not include pre-training of phonemes using a large-scale text corpus from the news-crawl dataset.

Yes, you can expect there would be a little quality drop (-0.09 as stated in the paper), but still get better quality than the baseline (VITS) due to other changes.