This is an implementation of MB-iSTFT-VITS
- Supported Language: Korean
- A Windows/Linux system with a minimum of
16GB
RAM. - A GPU with at least
12GB
of VRAM. - Python == 3.8
- Anaconda installed.
- PyTorch installed.
- CUDA 11.x installed.
- Zlib DLL installed.
Pytorch install command:
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117
CUDA 11.7 install:
https://developer.nvidia.com/cuda-11-7-0-download-archive
Zlib DLL install:
https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#install-zlib-windows
- Create an Anaconda environment:
conda create -n vits python=3.8
- Activate the environment:
conda activate vits
- Clone this repository to your local machine:
git clone https://github.com/ORI-Muchim/MB-iSTFT-VITS-Korean.git
- Navigate to the cloned directory:
cd MB-iSTFT-VITS-Korean
- Install the necessary dependencies:
pip install -r requirements.txt
"n_speakers" should be 0 in config.json
path/to/XXX.wav|transcript
- Example
dataset/001.wav|こんにちは。
Speaker id should start from 0
path/to/XXX.wav|speaker id|transcript
- Example
dataset/001.wav|0|こんにちは。
# Single speaker
python preprocess.py --text_index 1 --filelists path/to/filelist_train.txt path/to/filelist_val.txt --text_cleaners 'korean_cleaners'
# Mutiple speakers
python preprocess.py --text_index 2 --filelists path/to/filelist_train.txt path/to/filelist_val.txt --text_cleaners 'korean_cleaners'
If your speech file is either not Mono / PCM-16
, the you should resample your .wav file first.
Setting json file in configs
Model | How to set up json file in configs | Sample of json file configuration |
---|---|---|
iSTFT-VITS | "istft_vits": true, "upsample_rates": [8,8], |
ljs_istft_vits.json |
MB-iSTFT-VITS | "subbands": 4, "mb_istft_vits": true, "upsample_rates": [4,4], |
ljs_mb_istft_vits.json |
MS-iSTFT-VITS | "subbands": 4, "ms_istft_vits": true, "upsample_rates": [4,4], |
ljs_ms_istft_vits.json |
- If you have done preprocessing, set "cleaned_text" to true.
- Change
training_files
andvalidation_files
to the path of preprocessed manifest files. - Select same
text_cleaners
you used in preprocessing step.
# Single speaker
python train.py -c <config> -m <folder>
# Mutiple speakers
python train_ms.py -c <config> -m <folder>
Resume training from lastest checkpoint is automatic.
After the training, you can check inference audio using inference.ipynb
OR, Check inference_cpu.py
python inference_cpu.py {model_name} {model_step}