Few-shot multilingual-tts with rvc and vits

A tool that allows you to learn soft multilingual speech with a small amount of data set (5-10 minutes) using RVC. Most speech synthesis models require vast amounts of data. However, it is not always possible to learn only in situations where there is a lot of data. This repository started with the idea of "Then why don't we clone a dataset and use it?"

0.Process

RVC Training with few dataset
Dataset Cloning with Trained RVC Model.
Training Vits
Inference

1. Pre-requisites

Python >= 3.8
Download RVC-VITS.zip and unzip RVC-VITS.zip
Install python requirements. Please refer requirements.txt
1. You may need to install espeak first: apt-get install espeak
Build requirements.txt and torch

./set_env.sh

Put the dataset into the rvc_dataset directory according to the following file structure. In this experiment, I used 50 wavs files of ljspeech datasets (330 seconds).

rvc_dataset
├───ljs
│   ├───LJ001-0001.wav
│   ├───LJ001-0002.wav
│   ├───...
│   └───LJ001-0050.wav

2. Training

./train_rvc.sh ljs 500
# If you want to train korean tts, change ja to ko (ja -> japanese, ko -> korean, en -> english)
./make_dataset.sh ljs ja
./train_vits.sh ljs

3. Inference

See vits/inference.ipynb

3.5 Inference Voice Sample

See ljs_ja_voice

Test Datasets

Language	Name	Link
Korean	KSS	https://www.kaggle.com/datasets/bryanpark/korean-single-speaker-speech-dataset
Japanese	JSUT	https://sites.google.com/site/shinnosuketakamichi/publication/jsut
English	LJSPEECH	https://keithito.com/LJ-Speech-Dataset/

m4a1carbin4/RVC-VITS-webUIappend