Voice Cloning

Question

Voice Cloning

Opened this issue a year ago · 0 comments

Ask user for 2 inputs: text and user's audio sample (30 sec)
Then, run it through Coqui XTTS

Sample Python code:

from TTS.api import TTS
tts = TTS(model_name="tts_models/multilingual/multi-dataset/xtts_v2", gpu=False)

# generate speech by cloning a voice using default settings
tts.tts_to_file(text="The mission of the Data Science Lab to ensure an effective data ecosystem at Fred Hutch by developing a modern, well documented, well implemented, overall data strategy that evolves with the needs and capabilities of those leveraging data at Fred Hutch regardless of “where they live” from the clinic to the research groups.",
                 max_new_tokens = 600, # enable long text
                file_path="output-indian-girl-3.wav",
                speaker_wav="indian-girl.wav",
                language="en")

Corresponding R code (with reticulate):

library(reticulate)

# Specify Python
use_python("/opt/homebrew/Caskroom/miniforge/base/bin/python")

# [Python] from TTS.api import TTS
TTS_api <- import("TTS.api")
#  [Python] tts = TTS("tts_models/multilingual/multi-dataset/xtts_v1.1", gpu=False)
tts <- TTS_api$TTS("tts_models/multilingual/multi-dataset/xtts_v1.1", gpu = FALSE)


# [Python] tts.tts_to_file(text="It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
#                          file_path="output-howard.wav",
#                          speaker_wav="howard.wav",
#                          language="en")
tts$tts_to_file(text = "The mission of the Data Science Lab to ensure an effective data ecosystem at Fred Hutch by developing a modern, well documented, well implemented, overall data strategy that evolves with the needs and capabilities of those leveraging data at Fred Hutch regardless of “where they live” from the clinic to the research groups.", max_new_tokens = 600, file_path = "output-indian-girl-2.wav", speaker_wav = "indian-girl.wav", language = "en")