DigitalPhonetics/IMS-Toucan

Clone a Voice. How to improve?

vikolaz opened this issue · 9 comments

import os
import torch
from InferenceInterfaces.ToucanTTSInterface import ToucanTTSInterface

if __name__ == '__main__':
    tts = ToucanTTSInterface(device="cuda" if torch.cuda.is_available() else "cpu", tts_model_path="Meta", language="it")

    input_text = "my text to say"

    # Loop through the speaker reference audio files in the folder
    speaker_reference_folder = "input/folder"
    for file_name in os.listdir(speaker_reference_folder):
        if file_name.endswith('.wav'):
            speaker_reference = os.path.join(speaker_reference_folder, file_name)

            # Set the speaker embedding to clone the voice
            tts.set_utterance_embedding(speaker_reference)

            # Synthesize speech with the cloned voice
            output_file_name = "audios/cloned_voice.wav"
            tts.read_to_file(text_list=[input_text], file_location=output_file_name)

    del tts

I used this method to clone a voice, the result is something similar to the original voice but i guess it can improve.

There's a different/better approach to do this?
How big my dataset should be?

Now I've used like 7 sample of 1 minute each

Thanks

Actually look like it only take in cosidaration one file even if I give more as input

If set_utterance_embedding does not give you satisfaction I think you have to fine tune the model (Meta) properly on your dataset.

If you want to only use speaker reference 6 to 12 seconds is enough (a single one per speaker as you guessed already).

Do you know of any specific guides or tutorials that provide step-by-step instructions on how to perform fine-tuning with ToucanTTS?

i'm a beginner at coding :)

I am currently writing one but it is in French and still not finished since it is one of my many side projects !

Yet if you carefuly follow this project readme file "quietly" 😉 https://github.com/DigitalPhonetics/IMS-Toucan#build-a-toucantts-pipeline you will be able to fine tune Meta model on your dataset (you'll probably need more data than 7 minutes of audio). The instructions are really sufficient and there are already examples in the files to be modified to guide you.

Please mind that you need to create your dataset with the transcription of each audio sample (10 sec max). Ask ChatGPT how to generate a dataset for tts training it will give you advice if you need some. You'll also need a GPU for the training.

@vikolaz I have not officially published this yet, but:

  • it's a good example of specifically what's required
  • it includes a training set, ish
  • you can ignore the OpenShift/Jupyter parts. All the code/content would work locally on a system with Python and CUDA

https://github.com/OpenShiftDemos/ToucanTTS-RHODS-voice-cloning

A small update on this: Zero-shot voice cloning is being worked on right now. It does not sound good yet and I've already put multiple months into this. But hopefully with the next version, the model can be used to speak in an unseen voice much better even without finetuning and everything will be a bit simpler.

A small update on this: Zero-shot voice cloning is being worked on right now. It does not sound good yet and I've already put multiple months into this. But hopefully with the next version, the model can be used to speak in an unseen voice much better even without finetuning and everything will be a bit simpler.

Is it fixed in new release?

In the new release, voice cloning is definitely much better than it was, but there is still plenty of room to improve. I'll make an English-Only checkpoint in the next few weeks that's going to be focussed on speaker adaptation.

In the new release, voice cloning is definitely much better than it was, but there is still plenty of room to improve. I'll make an English-Only checkpoint in the next few weeks that's going to be focussed on speaker adaptation.

Can you also share training info afterwards, so would like to train for other languages (voice cloning)