/RVC-VITS-webUIappend

Few-shot multilingual tts with RVC and Vits + WebUI

Primary LanguagePythonMIT LicenseMIT

Few-shot multilingual-tts with rvc and vits

A tool that allows you to learn soft multilingual speech with a small amount of data set (5-10 minutes) using RVC. Most speech synthesis models require vast amounts of data. However, it is not always possible to learn only in situations where there is a lot of data. This repository started with the idea of "Then why don't we clone a dataset and use it?"

0.Process

  1. RVC Training with few dataset
  2. Dataset Cloning with Trained RVC Model.
  3. Training Vits
  4. Inference

1. Pre-requisites

  1. Python >= 3.8
  2. Download RVC-VITS.zip and unzip RVC-VITS.zip
  3. Install python requirements. Please refer requirements.txt
    1. You may need to install espeak first: apt-get install espeak
  4. Build requirements.txt and torch
./set_env.sh
  1. Put the dataset into the rvc_dataset directory according to the following file structure. In this experiment, I used 50 wavs files of ljspeech datasets (330 seconds).
rvc_dataset
├───ljs
│   ├───LJ001-0001.wav
│   ├───LJ001-0002.wav
│   ├───...
│   └───LJ001-0050.wav

2. Training

./train_rvc.sh ljs 500
# If you want to train korean tts, change ja to ko (ja -> japanese, ko -> korean, en -> english)
./make_dataset.sh ljs ja
./train_vits.sh ljs 

3. Inference

See vits/inference.ipynb

3.5 Inference Voice Sample

See ljs_ja_voice

Test Datasets

Language Name Link
Korean KSS https://www.kaggle.com/datasets/bryanpark/korean-single-speaker-speech-dataset
Japanese JSUT https://sites.google.com/site/shinnosuketakamichi/publication/jsut
English LJSPEECH https://keithito.com/LJ-Speech-Dataset/

References