yl4579/StyleTTS2

Multi-lingual training

Opened this issue · 30 comments

Thanks for wonderful work which gives good expressive TTS for English speakers. I was planning for Indian Multi-lingual TTS. For this purpose, I have few questions.

  1. Do we need to change only data and PL-BERT model or any other changes required ?
  2. can we use this ASR model ( ASR_path: "Utils/ASR/epoch_00080.pth") for other than English language ?
  3. If multiple languages, do we need to add multiple language data in OOD_data: "Data/OOD_texts.txt" ?
  4. Do we need to add any language id while doing data preparation, similar to adding speaker id in train_list.txt/val_list.txt?

You have to train the PL-bert model with the specific dataset of that particular language you want. A text dataset of size more than 30MB is also sufficient enough, though you can use larger dataset. Then use that trained PL-bert model in StyleTTS2. As you want to work with multilingual data, then of course you need to use specific phonemizer and tokenizer that supports that specific language. And you have to train StyleTTS2 (training stage1 and stage2) with the specific language dataset (train.txt, validate.txt and odd.txt).

@SandyPanda-MLDL Thanks for quick reply and answering first questions, I understood about training of PL-bert model with multi-lingual dataset.

How about other three questions?
2. can we use this ASR model ( ASR_path: "Utils/ASR/epoch_00080.pth") for other than English language ?
3. If multiple languages, do we need to add multiple language data in OOD_data: "Data/OOD_texts.txt" ?
4. Do we need to add any language id while doing data preparation, similar to adding speaker id in train_list.txt/val_list.txt?

@SandyPanda-MLDL Thanks for quick reply and answering first questions, I understood about training of PL-bert model with multi-lingual dataset.

How about other three questions? 2. can we use this ASR model ( ASR_path: "Utils/ASR/epoch_00080.pth") for other than English language ? 3. If multiple languages, do we need to add multiple language data in OOD_data: "Data/OOD_texts.txt" ? 4. Do we need to add any language id while doing data preparation, similar to adding speaker id in train_list.txt/val_list.txt?

According to the documentation in the readme, it states that the ASR model performs well in other languages. I tested it and indeed it works fine. However, when I trained my ASR model, StyleTTS improved dramatically. After this, I decided to train all models with my own data and achieved results exactly in terms of quality that the model delivers in English.

@traderpedroso Thanks for reply.

  1. Did you Finetune ASR model (https://github.com/yl4579/AuxiliaryASR) on the top of existing ASR model or trained from scratch with multiple languages ?

  2. Did you also tried to train PL-BERT model with multiple languages ? if yes, then can we combine multiple languages, do we need to give equal amount of training data for each language ?

@traderpedroso Thanks for reply.

  1. Did you Finetune ASR model (https://github.com/yl4579/AuxiliaryASR) on the top of existing ASR model or trained from scratch with multiple languages ?
  2. Did you also tried to train PL-BERT model with multiple languages ? if yes, then can we combine multiple languages, do we need to give equal amount of training data for each language ?

I used the PL-BERT recommended in the multilingual repository https://huggingface.co/papercup-ai/multilingual-pl-bert and it worked perfectly for ASR. I tested it with fine-tuning and also tried training from scratch; both approaches gave me the same result. Clearly, the ASR that I trained from scratch was for a single language.

From my experience training StyleTTS 2, it's only worthwhile because the inference is very fast and consumes little VRAM, but the training cost makes it somewhat unfeasible. Besides, you can only train the second stage with a single GPU. Clearly, I didn't train the model from scratch, which would be even more expensive, but I can guarantee that the quality is sensational. Another advantage of StyleTTS 2 is that it doesn’t hallucinate; the generated audios are extremely reliable, especially for real-time streaming applications that don’t need monitoring. However, in terms of cost vs. benefit, I personally prefer Tortoise for the final outcome.

@traderpedroso thanks, I understood Auxillary ASR part. Will train it from scratch if quality is bad.

  1. My use case is for Multi-lingual TTS with Indian languages, but Indian languages are not part of PL-BERT (https://huggingface.co/papercup-ai/multilingual-pl-bert ), so do you think can we still use multilingual-pl-bert for unseen languages ?
  2. For multiple languages, do we need to add multiple language data in OOD_data: "Data/OOD_texts.txt" ?
  3. Do we need to add any language id while doing data preparation for multi-lingual use case, similar to adding speaker id in train_list.txt/val_list.txt? because while inferencing how it will know which language to select ?

@traderpedroso thanks, I understood Auxillary ASR part. Will train it from scratch if quality is bad.

  1. My use case is for Multi-lingual TTS with Indian languages, but Indian languages are not part of PL-BERT (https://huggingface.co/papercup-ai/multilingual-pl-bert ), so do you think can we still use multilingual-pl-bert for unseen languages ?
  2. For multiple languages, do we need to add multiple language data in OOD_data: "Data/OOD_texts.txt" ?
  3. Do we need to add any language id while doing data preparation for multi-lingual use case, similar to adding speaker id in train_list.txt/val_list.txt? because while inferencing how it will know which language to select ?

Ensure that the speaker IDs are numbers. I personally used large numbers for the IDs, such as 3000, 3001, etc. You need to fine-tune the multilingual-pl-bert with your language if it is not listed. You do not need to add a language ID. Keep the data as in the example in the Data folder.

I added data in the same language I trained within the Data/OOD_texts.txt, but honestly, I believe it has no relevance because in the first 20 epochs I trained with the original Data/OOD_texts.txt, and the model was already generating quality audios.

In the inference, you need to put a dropdown list to select the language for your G2P, in this case, phonemizer, or use a library that detects the language and switches the lag in the phonemizer, for example, en-us, it, fr, etc.

@traderpedroso thanks for answering all my questions in detailed manner. I will try to build multi-lingual TTS model and will report if it is successful.

@traderpedroso How many hours of audio data did you use for training?

@traderpedroso How many hours of audio data did you use for training?

6 hours audio 24000hz batch size 4 on A100 80GB for 10 hours running for 10 epochs the first time i trained with len 300 30 epochs was bad quality after that i did a finetunning the same model for 10 epochs with 800 len after se second epoch was generating perfect audio

@traderpedroso How many hours of audio data did you use for training?

6 hours audio 24000hz batch size 4 on A100 80GB for 10 hours running for 10 epochs the first time i trained with len 300 30 epochs was bad quality after that i did a finetunning the same model for 10 epochs with 800 len after se second epoch was generating perfect audio

That is much less audio data than I expected. For the len that you changed, do you mean the max_len in the config?

@traderpedroso How many hours of audio data did you use for training?

6 hours audio 24000hz batch size 4 on A100 80GB for 10 hours running for 10 epochs the first time i trained with len 300 30 epochs was bad quality after that i did a finetunning the same model for 10 epochs with 800 len after se second epoch was generating perfect audio

That is much less audio data than I expected. For the len that you changed, do you mean the max_len in the config?

Yes, max_len of 800, but I found a more efficient way to train the fourth model that I trained. Now I followed this approach: first, I trained the model with audio from 2 seconds to a maximum of 4 seconds. Second, max_len of 300. Of course, the final quality wasn't interesting, but it was perfectly trained for 50 epochs in less than 2 hours. Then I did the finetuning for 5 epochs with audio of the "same length" of 8 seconds. The model turned out perfect, with zero noise in the end, and a smooth pronunciation. It became very humanized and much better, and I spent fewer resources on the training. The 8-second audios can be from various speakers with a maximum of 80 seconds each. In my case, I trained with 50 speakers and the fine-tuning only one hour dataset with max_len 800.

@traderpedroso Thanks for your insights.

  1. I was able to successfully FT the model with 4 hour of data for single Indic speaker. But in the end of audio, I hear some noisy click sounds. Any pointer to solve this issue ?
  2. I was able to use max_len=400 initial training, and max_len=300 for join training, If I increase the max_len, getting OOM. Did you used max_len=800 for joint training also ?

@traderpedroso Thanks for your insights.

  1. I was able to successfully FT the model with 4 hour of data for single Indic speaker. But in the end of audio, I hear some noisy click sounds. Any pointer to solve this issue ?
  2. I was able to use max_len=400 initial training, and max_len=300 for join training, If I increase the max_len, getting OOM. Did you used max_len=800 for joint training also ?

hey! were you able to build the PL-BERT model for hindi? i seem to be in the same situation as you.

@traderpedroso Thanks for your insights.

  1. I was able to successfully FT the model with 4 hour of data for single Indic speaker. But in the end of audio, I hear some noisy click sounds. Any pointer to solve this issue ?
  2. I was able to use max_len=400 initial training, and max_len=300 for join training, If I increase the max_len, getting OOM. Did you used max_len=800 for joint training also ?

You need to add silence padding to your audio before training. I added 500ms to the beginning and end of the audio file. Then, during inference, I implemented a workaround with.

def trim_audio(audio_np_array, sample_rate=24000, trim_ms=350):
    trim_samples = int(trim_ms * sample_rate / 1000)
    if len(audio_np_array) > 2 * trim_samples:
        trimmed_audio_np = audio_np_array[trim_samples:-trim_samples]
    else:
        trimmed_audio_np = audio_np_array
    return trimmed_audio_np

def tts(input: str, voice="Bia", output_sample_rate=24000, alpha=0.7, beta=0.7, diffusion_steps=5, embedding_scale=2, output_wav_file=None):
    text = normalizer(input)
    if text.strip() == "":
        raise ValueError("insert some text")
    if len(text) > 50000:
        raise ValueError("max 50.000 tokens")
    
    texts = split_sentence(text)
    audios = []
    for t in texts:
        audio = styletts2importable.inference(
            t,
            voices[voice],
            alpha=alpha,
            beta=beta,
            diffusion_steps=diffusion_steps,
            embedding_scale=embedding_scale,
        )
        trimmed_audio = trim_audio(audio)
        audios.append(trimmed_audio)
    output_audio = np.concatenate(audios)
    if output_wav_file:
        scipy.io.wavfile.write(output_wav_file, rate=output_sample_rate, data=output_audio)
    return output_sample_rate, output_audio

@traderpedroso Thanks for detailed answer and code, this is very helpful.

@tanishbajaj101 I have trained Hindi StyleTTS2 model, with existing English BERT model and it seems to be working fine without any issue. So I have not yet explored Hindi PL-BERT model.

@traderpedroso Thanks for detailed answer and code, this is very helpful.

I'm building a dataset creator for WebUI using models to recognize speakers, segment audio, and detect silence for cuts and padding. Using Whisper alone for cutting audio isn't ideal, it's hard to get good quality cuts. Doing it manually is a lot of work! I found some models on Hugging Face that might help, so I'm hoping to develop something that makes fine-tuning easier for everyone. If I get something working well, I'll share it here. Thanks, and see you later!

@SandyPanda-MLDL Thanks for quick reply and answering first questions, I understood about training of PL-bert model with multi-lingual dataset.
How about other three questions? 2. can we use this ASR model ( ASR_path: "Utils/ASR/epoch_00080.pth") for other than English language ? 3. If multiple languages, do we need to add multiple language data in OOD_data: "Data/OOD_texts.txt" ? 4. Do we need to add any language id while doing data preparation, similar to adding speaker id in train_list.txt/val_list.txt?

According to the documentation in the readme, it states that the ASR model performs well in other languages. I tested it and indeed it works fine. However, when I trained my ASR model, StyleTTS improved dramatically. After this, I decided to train all models with my own data and achieved results exactly in terms of quality that the model delivers in English.

If I want to train a new language model. There are a few steps I need to follow:

  1. Train a ASR model for the new language and use it in TTS training;
  2. Train a PL-Bert model for the new language and use it in TTS training;
  3. prepare audio-text-phoneme data;
  4. Train TTS model.
    Could you please tell me whether there are any other steps I need to do?

@traderpedroso thank you very much for all your detailed work, shortcuts and workarounds. I'm trying to train a good English model from scratch but since I'm using around 9000 WAV files with lengths from 1s to 20s, it's actually quite costly (although even at 20 epochs of stage 2, I'm already getting some good results).

If you'd like to put your ideas into a single place, I'd be more than willing to compile a Jupyter notebook that would incorporate your findings to help others train their models.

I'm going to try your shortcut approach on training Slovak (my language) model next month and will report how that went.

EDIT: did you try training a model for different language? I'm finding that the training code is not prepared for multilingual PL-BERT (which is what also some other people discovered) and I'm having trouble adjusting the code to cope with it. I commented more details in this PL-BERT multilingual repo fork.

Thanks again!

@traderpedroso thank you very much for all your detailed work, shortcuts and workarounds. I'm trying to train a good English model from scratch but since I'm using around 9000 WAV files with lengths from 1s to 20s, it's actually quite costly (although even at 20 epochs of stage 2, I'm already getting some good results).

If you'd like to put your ideas into a single place, I'd be more than willing to compile a Jupyter notebook that would incorporate your findings to help others train their models.

I'm going to try your shortcut approach on training Slovak (my language) model next month and will report how that went.

EDIT: did you try training a model for different language? I'm finding that the training code is not prepared for multilingual PL-BERT (which is what also some other people discovered) and I'm having trouble adjusting the code to cope with it. I commented more details in this PL-BERT multilingual repo fork.

Thanks again!

Yes, I trained with Brazilian Portuguese and Italian, and I had a lot of success in the results using PL-BERT. You don't need to change anything in the code, just replace everything in the /Utils/PLBERT/ folder with the multilingual version.

Another thing I noticed is that it's better to have a few audios with perfect cuts than hundreds of audios with cuts that generate noise. So, padding at the beginning and end of the audio is extremely necessary. I was a bit short on time these days, but next week I'll make a Gradio available to apply the cuts and make it easier. As I mentioned, training each time with audios of the same size generates better results. The English model in fine-tuning with only 5 epochs with 15 minutes of audio already starting SLM active had undeniable results, always using the rule of audios of the same size, the minimum of 4 seconds.

@traderpedroso thank you very much for all your detailed work, shortcuts and workarounds. I'm trying to train a good English model from scratch but since I'm using around 9000 WAV files with lengths from 1s to 20s, it's actually quite costly (although even at 20 epochs of stage 2, I'm already getting some good results).
If you'd like to put your ideas into a single place, I'd be more than willing to compile a Jupyter notebook that would incorporate your findings to help others train their models.
I'm going to try your shortcut approach on training Slovak (my language) model next month and will report how that went.
EDIT: did you try training a model for different language? I'm finding that the training code is not prepared for multilingual PL-BERT (which is what also some other people discovered) and I'm having trouble adjusting the code to cope with it. I commented more details in this PL-BERT multilingual repo fork.
Thanks again!

Yes, I trained with Brazilian Portuguese and Italian, and I had a lot of success in the results using PL-BERT. You don't need to change anything in the code, just replace everything in the /Utils/PLBERT/ folder with the multilingual version.

Another thing I noticed is that it's better to have a few audios with perfect cuts than hundreds of audios with cuts that generate noise. So, padding at the beginning and end of the audio is extremely necessary. I was a bit short on time these days, but next week I'll make a Gradio available to apply the cuts and make it easier. As I mentioned, training each time with audios of the same size generates better results. The English model in fine-tuning with only 5 epochs with 15 minutes of audio already starting SLM active had undeniable results, always using the rule of audios of the same size, the minimum of 4 seconds.

Thanks a lot for your quick reply. I was asking about PL-BERT because I have both, tried to train my own Slovak PL-BERT (Slovak language is not included in the multilingual PL-BERT) and use one that was trained by an external company. In both cases, I couldn't get the training code to work with these PL-BERTs. I understand that this is now beyond the original question but if you have any pointers how to do this, especially with the PL-BERT I provided a link to, I'd be very grateful.

Also, since token maps are not included with the PL-BERT trained externally, I needed to generate a token map myself. Since I couldn't complete my workflow, would you be able to see if this code actually saves the token map into a file as it should? I'd be very grateful for any input here as well:

import pickle

# Get the vocabulary
vocab = tokenizer.get_vocab()

# Save the vocabulary to a .pkl file
with open("slovak_token_map.pkl", "wb") as f:
    pickle.dump(vocab, f)

Sorry to hijack the thread a little but I think this all is still very relevant to creating a multilingual StyleTTS2 model :)

Thanks!

does anyone have an easy to follow jupyter notebook or webui?

does anyone have an easy to follow jupyter notebook or webui?

Not for multilingual but for single language I have created 2 notebooks here: #144

The training notebook is easily adaptable to multilingual by simply exchanging the PL-BERT subfolder in the Utils folder by the multilingual one, or one that you trained yourself. For example, I used https://huggingface.co/gerulata/slovakbert for Slovak language.

For a single-language, multi-speaker dataset of approximately 1k hours, primarily consisting of Hindi audiobook recordings, would you recommend training a model from scratch or fine-tuning?

@traderpedroso would you be able to write down a couple of points on how to train my own ASR? I tried to clone your https://github.com/traderpedroso/AuxiliaryASR but in the example Jupyter notebook, there is some metadata.txt file that I don't have, so I couldn't progress - and since I'm still fairly new to all this, I'll be very grateful for any pointers here... I already successfully implemented a Slovak PL-BERT and this is the last step for me to perfect my training :)

@traderpedroso would you be able to write down a couple of points on how to train my own ASR? I tried to clone your https://github.com/traderpedroso/AuxiliaryASR but in the example Jupyter notebook, there is some metadata.txt file that I don't have, so I couldn't progress - and since I'm still fairly new to all this, I'll be very grateful for any pointers here... I already successfully implemented a Slovak PL-BERT and this is the last step for me to perfect my training :)

https://github.com/yl4579/AuxiliaryASR

I suggest you use the official one, I made some workarounds to make it work with phonemizer in Brazilian Portuguese, you can usually create your train list and validation list already converted to phonemes as I was using for testing custom phonemes, I ended up modifying a lot of mine and I believe it won't be useful in your case, just to be clear, AuxiliaryASR training will improve pronunciation in the language you train it on, and it's not necessary for English.

However, if you want to use mine, simply modify meldataset.py where you have global_phonemizer = phonemizer.backend.EspeakBackend(language='pt-br', preserve_punctuation=True, with_stress=True). to global_phonemizer = phonemizer.backend.EspeakBackend(language='your language iso', preserve_punctuation=True, with_stress=True), remembering that your dataset cannot contain phonemes, but rather in this format LJSpeech-1.1/wavs/LJ048-0203.wav|The three officers confirm that their primary concern was crowd and traffic control,|0.

@traderpedroso How many hours of audio data did you use for training?

6 hours audio 24000hz batch size 4 on A100 80GB for 10 hours running for 10 epochs the first time i trained with len 300 30 epochs was bad quality after that i did a finetunning the same model for 10 epochs with 800 len after se second epoch was generating perfect audio

That is much less audio data than I expected. For the len that you changed, do you mean the max_len in the config?

Yes, max_len of 800, but I found a more efficient way to train the fourth model that I trained. Now I followed this approach: first, I trained the model with audio from 2 seconds to a maximum of 4 seconds. Second, max_len of 300. Of course, the final quality wasn't interesting, but it was perfectly trained for 50 epochs in less than 2 hours. Then I did the finetuning for 5 epochs with audio of the "same length" of 8 seconds. The model turned out perfect, with zero noise in the end, and a smooth pronunciation. It became very humanized and much better, and I spent fewer resources on the training. The 8-second audios can be from various speakers with a maximum of 80 seconds each. In my case, I trained with 50 speakers and the fine-tuning only one hour dataset with max_len 800.

Thanks for your detailed explanation! It's helpful for me.
But, I have a simple question: when you trained the model with audio from 2 ~ 4 seconds, did you preprocess anything from original audio? I mean, did cut or modify the original data to make clips?
and In fine-tuning stage, should I make the data "same length"?

@traderpedroso How many hours of audio data did you use for training?

6 hours audio 24000hz batch size 4 on A100 80GB for 10 hours running for 10 epochs the first time i trained with len 300 30 epochs was bad quality after that i did a finetunning the same model for 10 epochs with 800 len after se second epoch was generating perfect audio

That is much less audio data than I expected. For the len that you changed, do you mean the max_len in the config?

Yes, max_len of 800, but I found a more efficient way to train the fourth model that I trained. Now I followed this approach: first, I trained the model with audio from 2 seconds to a maximum of 4 seconds. Second, max_len of 300. Of course, the final quality wasn't interesting, but it was perfectly trained for 50 epochs in less than 2 hours. Then I did the finetuning for 5 epochs with audio of the "same length" of 8 seconds. The model turned out perfect, with zero noise in the end, and a smooth pronunciation. It became very humanized and much better, and I spent fewer resources on the training. The 8-second audios can be from various speakers with a maximum of 80 seconds each. In my case, I trained with 50 speakers and the fine-tuning only one hour dataset with max_len 800.

Hi :)

I'm trying your approach and have a question about the 2-4 seconds training. I'm doing 2nd stage and have config set like this:

diff_epoch: 20 # style diffusion starting epoch (2nd stage)
joint_epoch: 50 # joint training starting epoch (2nd stage)

What I found is that for about first 25 epochs, the quality of the model was actually really good. But after that, especially when the joint epoch started to kick in, I see a large degradation in the output quality (some letters not pronounced at all etc.).

So, my question is - what settings did you use for the 2-4s training, both 1st and 2nd stage? If you remember / can disclose this.

EDIT: I understand that the 2-4s training is not really about quality but I'm asking since I'm getting these inconsistent results

Thanks!

@traderpedroso How many hours of audio data did you use for training?

6 hours audio 24000hz batch size 4 on A100 80GB for 10 hours running for 10 epochs the first time i trained with len 300 30 epochs was bad quality after that i did a finetunning the same model for 10 epochs with 800 len after se second epoch was generating perfect audio

That is much less audio data than I expected. For the len that you changed, do you mean the max_len in the config?

Yes, max_len of 800, but I found a more efficient way to train the fourth model that I trained. Now I followed this approach: first, I trained the model with audio from 2 seconds to a maximum of 4 seconds. Second, max_len of 300. Of course, the final quality wasn't interesting, but it was perfectly trained for 50 epochs in less than 2 hours. Then I did the finetuning for 5 epochs with audio of the "same length" of 8 seconds. The model turned out perfect, with zero noise in the end, and a smooth pronunciation. It became very humanized and much better, and I spent fewer resources on the training. The 8-second audios can be from various speakers with a maximum of 80 seconds each. In my case, I trained with 50 speakers and the fine-tuning only one hour dataset with max_len 800.

OK, so I tried this approach with a well-prepared custom English dataset but the results were not very good. While the humanization worked reasonably well, the quality wasn't too good. There were quite a few artifacts when I finetuned the 2-4s model.

I then tried to continue training the 2-4s model but with 7-9s data. The quality improved a lot (much more than when I tried to finetune). However, I was still hearing artifacts in the audio (but I've only trained with 7-9s wavs for like 17 epochs, so maybe it would go away eventually).