/MB-iSTFT-VITS-Korean

Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform with Korean Cleaners

Primary LanguagePythonApache License 2.0Apache-2.0

MB-iSTFT-VITS

This is an implementation of MB-iSTFT-VITS

  • Supported Language: Korean

Table of Contents

Prerequisites

  • A Windows/Linux system with a minimum of 16GB RAM.
  • A GPU with at least 12GB of VRAM.
  • Python == 3.8
  • Anaconda installed.
  • PyTorch installed.
  • CUDA 11.x installed.
  • Zlib DLL installed.

Pytorch install command:

pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117

CUDA 11.7 install: https://developer.nvidia.com/cuda-11-7-0-download-archive

Zlib DLL install: https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#install-zlib-windows


Installation

  1. Create an Anaconda environment:
conda create -n vits python=3.8
  1. Activate the environment:
conda activate vits
  1. Clone this repository to your local machine:
git clone https://github.com/ORI-Muchim/MB-iSTFT-VITS-Korean.git
  1. Navigate to the cloned directory:
cd MB-iSTFT-VITS-Korean
  1. Install the necessary dependencies:
pip install -r requirements.txt

Create transcript

Single speaker

"n_speakers" should be 0 in config.json

path/to/XXX.wav|transcript
  • Example
dataset/001.wav|こんにちは。

Mutiple speakers

Speaker id should start from 0

path/to/XXX.wav|speaker id|transcript
  • Example
dataset/001.wav|0|こんにちは。

Preprocess

# Single speaker
python preprocess.py --text_index 1 --filelists path/to/filelist_train.txt path/to/filelist_val.txt --text_cleaners 'korean_cleaners'

# Mutiple speakers
python preprocess.py --text_index 2 --filelists path/to/filelist_train.txt path/to/filelist_val.txt --text_cleaners 'korean_cleaners'

If your speech file is either not Mono / PCM-16, the you should resample your .wav file first.

Setting json file in configs

Model How to set up json file in configs Sample of json file configuration
iSTFT-VITS "istft_vits": true,
"upsample_rates": [8,8],
ljs_istft_vits.json
MB-iSTFT-VITS "subbands": 4,
"mb_istft_vits": true,
"upsample_rates": [4,4],
ljs_mb_istft_vits.json
MS-iSTFT-VITS "subbands": 4,
"ms_istft_vits": true,
"upsample_rates": [4,4],
ljs_ms_istft_vits.json
  • If you have done preprocessing, set "cleaned_text" to true.
  • Change training_files and validation_files to the path of preprocessed manifest files.
  • Select same text_cleaners you used in preprocessing step.

Training

# Single speaker
python train.py -c <config> -m <folder>

# Mutiple speakers
python train_ms.py -c <config> -m <folder>

Resume training from lastest checkpoint is automatic.

After the training, you can check inference audio using inference.ipynb

OR, Check inference_cpu.py

python inference_cpu.py {model_name} {model_step}

References