GPT-SoVITS-WebUI

A Powerful Few-shot Voice Conversion and Text-to-Speech WebUI.

Check out our demo video here!

few.shot.fine.tuning.demo.mp4

Features:

Zero-shot TTS: Input a 5-second vocal sample and experience instant text-to-speech conversion.
Few-shot TTS: Fine-tune the model with just 1 minute of training data for improved voice similarity and realism.
Cross-lingual Support: Inference in languages different from the training dataset, currently supporting English, Japanese, and Chinese.
WebUI Tools: Integrated tools include voice accompaniment separation, automatic training set segmentation, Chinese ASR, and text labeling, assisting beginners in creating training datasets and GPT/SoVITS models.

Environment Preparation

If you are a Windows user (tested with win>=10) you can install directly via the prezip. Just download the prezip, unzip it and double-click go-webui.bat to start GPT-SoVITS-WebUI.

Tested Environments

Python 3.9, PyTorch 2.0.1, CUDA 11
Python 3.10.13, PyTorch 2.1.2, CUDA 12.3

Note: numba==0.56.4 require py<3.11

Quick Install with Conda

conda create -n GPTSoVits python=3.9
conda activate GPTSoVits
bash install.sh

Install Manually

Make sure you have the distutils for python3.9 installed

sudo apt-get install python3.9-distutils

Pip Packages

pip install torch numpy scipy tensorboard librosa==0.9.2 numba==0.56.4 pytorch-lightning gradio==3.14.0 ffmpeg-python onnxruntime tqdm cn2an pypinyin pyopenjtalk g2p_en chardet

Additional Requirements

If you need Chinese ASR (supported by FunASR), install:

pip install modelscope torchaudio sentencepiece funasr

FFmpeg

Conda Users

conda install ffmpeg

Ubuntu/Debian Users

sudo apt install ffmpeg
sudo apt install libsox-dev
conda install -c conda-forge 'ffmpeg<7'

MacOS Users

brew install ffmpeg

Windows Users

Download and place ffmpeg.exe and ffprobe.exe in the GPT-SoVITS root.

Pretrained Models

Download pretrained models from GPT-SoVITS Models and place them in GPT_SoVITS/pretrained_models.

For Chinese ASR (additionally), download models from Damo ASR Model, Damo VAD Model, and Damo Punc Model and place them in tools/damo_asr/models.

For UVR5 (Vocals/Accompaniment Separation & Reverberation Removal, additionally), download models from UVR5 Weights and place them in tools/uvr5/uvr5_weights.

Dataset Format

The TTS annotation .list file format:

vocal_path|speaker_name|language|text

Language dictionary:

'zh': Chinese
'ja': Japanese
'en': English

Example:

D:\GPT-SoVITS\xxx/xxx.wav|xxx|en|I like playing Genshin.

Todo List

Credits

Special thanks to the following projects and contributors:

jingx8885/GPT-SoVITS