TurnVoice

A command-line tool (currently in pre-alpha) to transform voices in YouTube videos with additional translation capabilities.

New Features

Voice replacement: Strips out vocal track and recomposes to preserve original background audio
Speaker diarization: Replace specific speaker voice from a video

Prerequisites

Rubberband command-line utility installed ¹
Deezer's Spleeter command-line utility installed ²
Huggingface conditions accepted for Speaker Diarization and Segmentation
Huggingface access token in env variable HF_ACCESS_TOKEN ³

Tip

For Deezer's Spleeter CLI install Python 3.8, then run pipx install spleeter --python /path/to/python3.8 (pip install pipx)
Set your HF token with `setx HF_ACCESS_TOKEN "your_token_here"

Installation

pip install turnvoice

Tip

For faster rendering with GPU prepare your CUDA environment after installation:

For CUDA 11.8
pip install torch==2.1.1+cu118 torchaudio==2.1.1+cu118 --index-url https://download.pytorch.org/whl/cu118

For CUDA 12.1
pip install torch==2.1.1+cu118 torchaudio==2.1.1+cu211 --index-url https://download.pytorch.org/whl/cu211

Usage

turnvoice [-u] <YouTube URL|ID> [-l] <Translation Language> -v <Voice File> -o <Output File>

Example Command:

Arthur Morgan narrating a cooking tutorial:

turnvoice AmC9SmCBUj4 -v arthur.wav -o cooking_with_arthur.mp4

Note

This example needs a arthur.wav (or.json) file in the same directory. Works when executed from the tests directory.

Parameters Explained:

-i, --in: (required) The YouTube video ID or URL you want to transform
-l, --language: Language to translate to (supported: en, es, fr, de, it, pt, pl, tr, ru, nl, cs, ar, zh, ja, hu, ko) leaving this out keeps the source video language
-v, --voice: Your chosen voice in wav format (24kHz, 16 bit, mono, ~10-30s)
-o, --output_video: The grand finale video file name (default: 'final_cut.mp4')
-a, --analysis: Perform speaker analysis. Generates the speaker diarization, doesn't render the video.
-s, --speaker: Speaker number to be turned. Speakers are sorted by amount of speech. Perform --analysis before.
-smax, --speaker_max: Maximal numbers of speakers in the video. Set to 2 or 3 for better results in multiple speaker scenarios.
-from, --from: Time to start processing the video from
-to, --to: Time to stop processing the video at
-dd, --download_directory: Where to save the video downloads (default: 'downloads')
-sd, --synthesis_directory: Where to save the text to speech audio files (default: 'synthesis')
-e, --extractoff: Use with -e to disable extract audio directly from the video (may lead to higher quality while also increasing likelihood of errors)
-c, --clean_audio: No preserve of original audio in the final video. Returns clean synthesis

You can leave out -i and -l as first parameters.

What to expect

might not always achieve perfect lip synchronization, especially when translating to a different language
translation feature is currently in experimental prototype state (powered by Meta's nllb-200-distilled-600m) and still produces very imperfect results
occasionally, the synthesis might introduce unexpected noises or distortions in the audio (we got way better reducing artifacts with the new v0.0.30 algo)

Source Quality

delivers best results with YouTube videos featuring clear spoken content (podcasts, educational videos)
requires a high-quality, clean source WAV file for effective voice cloning

Pro Tips

First perform a speaker analysis:

How to exchange a single speaker

First perform a speaker analysis with -a parameter:

turnvoice https://www.youtube.com/watch?v=2N3PsXPdkmM -a

Then select a speaker from the list with -s parameter

turnvoice https://www.youtube.com/watch?v=2N3PsXPdkmM -s 2

The Art of Choosing a Reference Wav

A 24000, 44100 or 22050 Hz 16-bit mono wav file of 10-30 seconds is your golden ticket.
24k mono 16 is my default, but I also had voices where I found 44100 32-bit to yield best results
I test voices with this tool before rendering
Audacity is your friend for adjusting sample rates. Experiment with frame rates for best results!

Fixed TTS Model Download Folder

Keep your models organized! Set COQUI_MODEL_PATH to your preferred folder.

Windows example:

setx COQUI_MODEL_PATH "C:\Downloads\CoquiModels"

Future Improvements

TTS Voice variety: Add OpenAI TTS, Azure and Elevenlabs as voice sources.
Tranlation quality: Add option to translate with OpenAI, DeepL API, other models. Better logic than simply transcribe the frags.
Voice Cloning from YouTube: Cloning voices directly from other videos.
Speed up to realtiem: Feed streams and get a "realtime" (translated) stream with voice of choice
Open up the CLI: Allow local Videos, Audios and even Textfiles as Input until down to turnvoice "Hello World"

License

TurnVoice is proudly under the Coqui Public Model License 1.0.0 and NLLB-200 CC-BY-NC License (these are OpenSource NonCommercial licenses).

Let's Make It Fun! 🎉

Share your funniest or most creative TurnVoice creations with me!

And if you've got a cool feature idea or just want to say hi, drop me a line on

If you like the repo please leave a star ✨ 🌟 ✨

Rubberband is needed to pitchpreserve timestretch audios for fitting synthesis into timewindow ↩
Deezer's Spleeter is needed to split vocals for original audio preservation ↩
Huggingface access token is needed to download the speaker diarization model for identifying speakers with pyannote.audio ↩

zhaopengme/TurnVoice