yl4579/StyleTTS2

Better LJSpeech or LibriTTS for finetuning a single speaker voice? Or training from scratch with not so much data?

Sweetapocalyps3 opened this issue · 4 comments

Hi everyone,

I'm wondering if it should be LJSpeech or LibriTTS the proper candidate to be used to finetune a single person voice.
I've seen that there is a multispeaker boolean field in the configuration, which in my case should be set to false, but I don't know if this imply I have to use LJSpeech, since LibriTTS is a multispeaker.

Maybe is it even better to train the model from scratch? I'm thinking about it, but I suppose I have too few samples (126 files of clean audio for a total of almost 19 minutes)

Thank you in advance.

LibriTTS is by far the better choice, the model has seen multiple speakers, and can adapt far better to a smaller dataset for a single speaker.

You can leave all of the settings in config_ft.yml the same (Changing only dataset, then batch size and window size depending on your hardware). Multi-speaker should be kept on true, just make sure that in your dataset metafiles the speaker_id is set to the same id for each file.

Training the model from scratch from with 19 minutes of data will most likely yield bad results, although I haven't tried myself.

Helpful details on fine-tuning: #81