/VALL-E-TTS

Primary LanguagePythonMIT LicenseMIT

VALL-E X: Multilingual Text-to-Speech Synthesis and Voice Cloning ๐Ÿ”Š


English | [ไธญๆ–‡](README-ZH.md)
An open source implementation of Microsoft's [VALL-E X](https://arxiv.org/pdf/2303.03926) zero-shot TTS model.
**We release our trained model to the public for research or application usage.**

vallex-framework

VALL-E X is an amazing multilingual text-to-speech (TTS) model proposed by Microsoft. While Microsoft initially publish in their research paper, they did not release any code or pretrained models. Recognizing the potential and value of this technology, our team took on the challenge to reproduce the results and train our own model. We are glad to share our trained VALL-E X model with the community, allowing everyone to experience the power next-generation TTS! ๐ŸŽง

๐Ÿ“– Quick Index

๐Ÿš€ Updates

2023.09.10

  • Added AR decoder batch decoding for more stable generation result.

2023.08.30

  • Replaced EnCodec decoder with Vocos decoder, improved audio quality. (Thanks to @v0xie)

2023.08.23

  • Added long text generation.

2023.08.14

  • Pretrained VALL-E X checkpoint is now released. Download it here

๐Ÿ’ป Installation

Install with pip, Python 3.10, CUDA 11.7 ~ 12.0, PyTorch 2.0+

commandline git clone https://github.com/Plachtaa/VALL-E-X.git cd VALL-E-X pip install -r requirements.txt

Note: If you want to make prompt, you need to install ffmpeg and add its folder to the environment variable PATH.

When you run the program for the first time, it will automatically download the corresponding model.

If the download fails and reports an error, please follow the steps below to manually download the model.

(Please pay attention to the capitalization of folders)

  1. Check whether there is a checkpoints folder in the installation directory. If not, manually create a checkpoints folder (./checkpoints/) in the installation directory.

  2. Check whether there is a vallex-checkpoint.pt file in the checkpoints folder. If not, please manually download the vallex-checkpoint.pt file from here and put it in the checkpoints folder.

  3. Check whether there is a whisper folder in the installation directory. If not, manually create a whisper folder (./whisper/) in the installation directory.

  4. Check whether there is a medium.pt file in the whisper folder. If not, please manually download the medium.pt file from here and put it in the whisper folder.

๐ŸŽง Demos

Not ready to set up the environment on your local machine just yet? No problem! We've got you covered with our online demos. You can try out VALL-E X directly on Hugging Face or Google Colab, experiencing the model's capabilities hassle-free!
Open in Spaces Open In Colab

๐Ÿ“ข Features

VALL-E X comes packed with cutting-edge functionalities:

  1. Multilingual TTS: Speak in three languages - English, Chinese, and Japanese - with natural and expressive speech synthesis.

  2. Zero-shot Voice Cloning: Enroll a short 3~10 seconds recording of an unseen speaker, and watch VALL-E X create personalized, high-quality speech that sounds just like them!

see example
prompt.webm
output.webm
  1. Speech Emotion Control: Experience the power of emotions! VALL-E X can synthesize speech with the same emotion as the acoustic prompt provided, adding an extra layer of expressiveness to your audio.
see example
sleepy-prompt.mp4
sleepy-output.mp4
  1. Zero-shot Cross-Lingual Speech Synthesis: Take monolingual speakers on a linguistic journey! VALL-E X can produce personalized speech in another language without compromising on fluency or accent. Below is a Japanese speaker talk in Chinese & English. ๐Ÿ‡ฏ๐Ÿ‡ต ๐Ÿ—ฃ
see example
jp-prompt.webm
en-output.webm
zh-output.webm
  1. Accent Control: Get creative with accents! VALL-E X allows you to experiment with different accents, like speaking Chinese with an English accent or vice versa. ๐Ÿ‡จ๐Ÿ‡ณ ๐Ÿ’ฌ
see example
en-prompt.webm
zh-accent-output.webm
en-accent-output.webm
  1. Acoustic Environment Maintenance: No need for perfectly clean audio prompts! VALL-E X adapts to the acoustic environment of the input, making speech generation feel natural and immersive.
see example
noise-prompt.webm
noise-output.webm

Explore our demo page for a lot more examples!

๐Ÿ Usage in Python

๐Ÿช‘ Basics

python from utils.generation import SAMPLE_RATE, generate_audio, preload_models from scipy.io.wavfile import write as write_wav from IPython.display import Audio

download and load all models

preload_models()

generate audio from text

text_prompt = """ Hello, my name is Nose. And uh, and I like hamburger. Hahaha... But I also have other interests such as playing tactic toast. """ audio_array = generate_audio(text_prompt)

save audio to disk

write_wav("vallex_generation.wav", SAMPLE_RATE, audio_array)

play text in notebook

Audio(audio_array, rate=SAMPLE_RATE)

hamburger.webm

๐ŸŒŽ Foreign Language


This VALL-E X implementation also supports Chinese and Japanese. All three languages have equally awesome performance!

python

text_prompt = """ ใƒใƒฅใ‚ฝใ‚ฏใฏ็งใฎใŠๆฐ—ใซๅ…ฅใ‚Šใฎ็ฅญใ‚Šใงใ™ใ€‚ ็งใฏๆ•ฐๆ—ฅ้–“ไผ‘ใ‚“ใงใ€ๅ‹ไบบใ‚„ๅฎถๆ—ใจใฎๆ™‚้–“ใ‚’้Žใ”ใ™ใ“ใจใŒใงใใพใ™ใ€‚ """ audio_array = generate_audio(text_prompt)

vallex_japanese.webm

Note: VALL-E X controls accent perfectly even when synthesizing code-switch text. However, you need to manually denote language of respective sentences (since our g2p tool is rule-base) python text_prompt = """ [EN]The Thirty Years' War was a devastating conflict that had a profound impact on Europe.[EN] [ZH]่ฟ™ๆ˜ฏๅŽ†ๅฒ็š„ๅผ€ๅง‹ใ€‚ ๅฆ‚ๆžœๆ‚จๆƒณๅฌๆ›ดๅคš๏ผŒ่ฏท็ปง็ปญใ€‚[ZH] """ audio_array = generate_audio(text_prompt, language='mix')

vallex_codeswitch.webm

๐Ÿ“ผ Voice Presets

VALL-E X provides tens of speaker voices which you can directly used for inference! Browse all voices in the code

VALL-E X tries to match the tone, pitch, emotion and prosody of a given preset. The model also attempts to preserve music, ambient noise, etc.

python text_prompt = """ I am an innocent boy with a smoky voice. It is a great honor for me to speak at the United Nations today. """ audio_array = generate_audio(text_prompt, prompt="dingzhen")

smoky.webm

๐ŸŽ™Voice Cloning

VALL-E X supports voice cloning! You can make a voice prompt with any person, character or even your own voice, and use it like other voice presets.
To make a voice prompt, you need to provide a speech of 3~10 seconds long, as well as the transcript of the speech. You can also leave the transcript blank to let the Whisper model to generate the transcript.

VALL-E X tries to match the tone, pitch, emotion and prosody of a given prompt. The model also attempts to preserve music, ambient noise, etc.

python from utils.prompt_making import make_prompt

Use given transcript

make_prompt(name="paimon", audio_prompt_path="paimon_prompt.wav", transcript="Just, what was that? Paimon thought we were gonna get eaten.")

Alternatively, use whisper

make_prompt(name="paimon", audio_prompt_path="paimon_prompt.wav")

Now let's try out the prompt we've just made! python from utils.generation import SAMPLE_RATE, generate_audio, preload_models from scipy.io.wavfile import write as write_wav

download and load all models

preload_models()

text_prompt = """ Hey, Traveler, Listen to this, This machine has taken my voice, and now it can talk just like me! """ audio_array = generate_audio(text_prompt, prompt="paimon")

write_wav("paimon_cloned.wav", SAMPLE_RATE, audio_array)

paimon_prompt.webm
paimon_cloned.webm

๐ŸŽขUser Interface

Not comfortable with codes? No problem! We've also created a user-friendly graphical interface for VALL-E X. It allows you to interact with the model effortlessly, making voice cloning and multilingual speech synthesis a breeze.
You can launch the UI by the following command: commandline python -X utf8 launch-ui.py

๐Ÿ› ๏ธ Hardware and Inference Speed

VALL-E X works well on both CPU and GPU (pytorch 2.0+, CUDA 11.7 and CUDA 12.0).

A GPU VRAM of 6GB is enough for running VALL-E X without offloading.

โš™๏ธ Details

VALL-E X is similar to Bark, VALL-E and AudioLM, which generates audio in GPT-style by predicting audio tokens quantized by EnCodec.
Comparing to Bark:

  • โœ” Light-weighted: 3๏ธโƒฃ โœ– smaller,
  • โœ” Efficient: 4๏ธโƒฃ โœ– faster,
  • โœ” Better quality on Chinese & Japanese
  • โœ” Cross-lingual speech without foreign accent
  • โœ” Easy voice-cloning
  • โŒ Less languages
  • โŒ No special tokens for music / sound effects

Supported Languages

Language Status
English (en) โœ…
Japanese (ja) โœ…
Chinese, simplified (zh) โœ…

โ“ FAQ

Where is code for training?

  • lifeiteng's vall-e has almost everything. There is no plan to release our training code because there is no difference between lifeiteng's implementation.

How much VRAM do I need?

  • 6GB GPU VRAM - Almost all NVIDIA GPUs satisfy the requirement.

Why the model fails to generate long text?

  • Transformer's computation complexity increases quadratically while the sequence length increases. Hence, all training are kept under 22 seconds. Please make sure the total length of audio prompt and generated audio is less than 22 seconds to ensure acceptable performance.

MORE TO BE ADDED...

๐Ÿ™ Appreciation

๐Ÿ“œ License

VALL-E X is licensed under the MIT License.


Happy voice cloning! ๐ŸŽค