Check out our demo video here!
few.shot.fine.tuning.demo.mp4
-
Zero-shot TTS: Input a 5-second vocal sample and experience instant text-to-speech conversion.
-
Few-shot TTS: Fine-tune the model with just 1 minute of training data for improved voice similarity and realism.
-
Cross-lingual Support: Inference in languages different from the training dataset, currently supporting English, Japanese, and Chinese.
-
WebUI Tools: Integrated tools include voice accompaniment separation, automatic training set segmentation, Chinese ASR, and text labeling, assisting beginners in creating training datasets and GPT/SoVITS models.
If you are a Windows user (tested with win>=10) you can install directly via the prezip. Just download the prezip, unzip it and double-click go-webui.bat to start GPT-SoVITS-WebUI.
- Python 3.9, PyTorch 2.0.1, CUDA 11
- Python 3.10.13, PyTorch 2.1.2, CUDA 12.3
Note: numba==0.56.4 require py<3.11
conda create -n GPTSoVits python=3.9
conda activate GPTSoVits
bash install.sh
pip install torch numpy scipy tensorboard librosa==0.9.2 numba==0.56.4 pytorch-lightning gradio==3.14.0 ffmpeg-python onnxruntime tqdm cn2an pypinyin pyopenjtalk g2p_en chardet transformers jieba_fast
If you need Chinese ASR (supported by FunASR), install:
pip install modelscope torchaudio sentencepiece funasr
conda install ffmpeg
sudo apt install ffmpeg
sudo apt install libsox-dev
conda install -c conda-forge 'ffmpeg<7'
brew install ffmpeg
Download and place ffmpeg.exe and ffprobe.exe in the GPT-SoVITS root.
Download pretrained models from GPT-SoVITS Models and place them in GPT_SoVITS/pretrained_models
.
For Chinese ASR (additionally), download models from Damo ASR Model, Damo VAD Model, and Damo Punc Model and place them in tools/damo_asr/models
.
For UVR5 (Vocals/Accompaniment Separation & Reverberation Removal, additionally), download models from UVR5 Weights and place them in tools/uvr5/uvr5_weights
.
- Environment Variables:
- is_half: Controls half-precision/double-precision. This is typically the cause if the content under the directories 4-cnhubert/5-wav32k is not generated correctly during the "SSL extracting" step. Adjust to True or False based on your actual situation.
- Volumes Configuration,The application's root directory inside the container is set to /workspace. The default docker-compose.yaml lists some practical examples for uploading/downloading content.
- shm_size: The default available memory for Docker Desktop on Windows is too small, which can cause abnormal operations. Adjust according to your own situation.
- Under the deploy section, GPU-related settings should be adjusted cautiously according to your system and actual circumstances.
docker compose -f "docker-compose.yaml" up -d
As above, modify the corresponding parameters based on your actual situation, then run the following command:
docker run --rm -it --gpus=all --env=is_half=False --volume=G:\GPT-SoVITS-DockerTest\output:/workspace/output --volume=G:\GPT-SoVITS-DockerTest\logs:/workspace/logs --volume=G:\GPT-SoVITS-DockerTest\SoVITS_weights:/workspace/SoVITS_weights --workdir=/workspace -p 9870:9870 -p 9871:9871 -p 9872:9872 -p 9873:9873 -p 9874:9874 --shm-size="16G" -d breakstring/gpt-sovits:dev-20240123.03
The TTS annotation .list file format:
vocal_path|speaker_name|language|text
Language dictionary:
- 'zh': Chinese
- 'ja': Japanese
- 'en': English
Example:
D:\GPT-SoVITS\xxx/xxx.wav|xxx|en|I like playing Genshin.
-
High Priority:
- Localization in Japanese and English.
- User guide.
- Japanese and English dataset fine tune training.
-
Features:
- Zero-shot voice conversion (5s) / few-shot voice conversion (1min).
- TTS speaking speed control.
- Enhanced TTS emotion control.
- Experiment with changing SoVITS token inputs to probability distribution of vocabs.
- Improve English and Japanese text frontend.
- Develop tiny and larger-sized TTS models.
- Colab scripts.
- Try expand training dataset (2k hours -> 10k hours).
- better sovits base model (enhanced audio quality)
- model mix
Special thanks to the following projects and contributors:
- ar-vits
- SoundStorm
- vits
- TransferTTS
- Chinese Speech Pretrain
- contentvec
- hifi-gan
- Chinese-Roberta-WWM-Ext-Large
- fish-speech
- ultimatevocalremovergui
- audio-slicer
- SubFix
- FFmpeg
- gradio