Check out our demo video here!
few.shot.fine.tuning.demo.mp4
-
Zero-shot TTS: Input a 5-second vocal sample and experience instant text-to-speech conversion.
-
Few-shot TTS: Fine-tune the model with just 1 minute of training data for improved voice similarity and realism.
-
Cross-lingual Support: Inference in languages different from the training dataset, currently supporting English, Japanese, and Chinese.
-
WebUI Tools: Integrated tools include voice accompaniment separation, automatic training set segmentation, Chinese ASR, and text labeling, assisting beginners in creating training datasets and GPT/SoVITS models.
If you are a Windows user (tested with win>=10) you can install directly via the prezip. Just download the prezip, unzip it and double-click go-webui.bat to start GPT-SoVITS-WebUI.
Tested with Python 3.9, PyTorch 2.0.1, and CUDA 11.
conda create -n GPTSoVits python=3.9
conda activate GPTSoVits
bash install.sh
pip install torch numpy scipy tensorboard librosa==0.9.2 numba==0.56.4 pytorch-lightning gradio==3.14.0 ffmpeg-python onnxruntime tqdm cn2an pypinyin pyopenjtalk g2p_en chardet
If you need Chinese ASR (supported by FunASR), install:
pip install modelscope torchaudio sentencepiece funasr
conda install ffmpeg
sudo apt install ffmpeg
sudo apt install libsox-dev
conda install -c conda-forge 'ffmpeg<7'
brew install ffmpeg
Download and place ffmpeg.exe and ffprobe.exe in the GPT-SoVITS root.
Download pretrained models from GPT-SoVITS Models and place them in GPT_SoVITS\pretrained_models
.
For Chinese ASR (additionally), download models from Damo ASR Model, Damo VAD Model, and Damo Punc Model and place them in tools/damo_asr/models
.
For UVR5 (Vocals/Accompaniment Separation & Reverberation Removal, additionally), download models from UVR5 Weights and place them in tools/uvr5/uvr5_weights
.
The TTS annotation .list file format:
vocal_path|speaker_name|language|text
Language dictionary:
- 'zh': Chinese
- 'ja': Japanese
- 'en': English
Example:
D:\GPT-SoVITS\xxx/xxx.wav|xxx|en|I like playing Genshin.
-
High Priority:
- Localization in Japanese and English.
- User guide.
- Japanese and English dataset fine tune training.
-
Features:
- Zero-shot voice conversion (5s) / few-shot voice conversion (1min).
- TTS speaking speed control.
- Enhanced TTS emotion control.
- Experiment with changing SoVITS token inputs to probability distribution of vocabs.
- Improve English and Japanese text frontend.
- Develop tiny and larger-sized TTS models.
- Colab scripts.
- Expand training dataset (2k -> 10k).
- better sovits base model (enhanced audio quality)
- model mix
Special thanks to the following projects and contributors:
- ar-vits
- SoundStorm
- vits
- TransferTTS
- Chinese Speech Pretrain
- contentvec
- hifi-gan
- Chinese-Roberta-WWM-Ext-Large
- fish-speech
- ultimatevocalremovergui
- audio-slicer
- SubFix
- FFmpeg
- gradio