Variational Inference with adversarial learning for end-to-end Singing Voice Conversion based on VITS

This project is target for: beginners in deep learning, the basic operation of Python and PyTorch is the prerequisite for using this project;
This project aims to help deep learning beginners get rid of boring pure theoretical learning, and master the basic knowledge of deep learning by combining it with practice;
This project does not support real-time voice change; (support needs to replace whisper)
This project will not develop one-click packages for other purposes；

6G memory GPU can be used to trained
support for multiple speakers
create unique speakers through speaker mixing
even with light accompaniment can also be converted
F0 can be edited using Excel

Model properties

Feature	From	Status	Function
whisper	OpenAI	✅	strong noise immunity
bigvgan	NVIDA	✅	alias and snake
natural speech	Microsoft	✅	reduce mispronunciation
neural source-filter	NII	✅	solve the problem of audio F0 discontinuity
speaker encoder	Google	✅	Timbre Encoding and Clustering
GRL for speaker	Ubisoft	✅	Preventing Encoder Leakage Timbre
one shot vits	Samsung	✅	Voice Clone
SCLN	Microsoft	✅	Improve Clone
PPG perturbation	this project	✅	Improved noise immunity and de-timbre
HuBERT perturbation	this project	✅	Improved noise immunity and de-timbre
VAE perturbation	this project	✅	Improve sound quality

due to the use of data perturbation, it takes longer to train than other projects.

Dataset preparation

Necessary pre-processing:

1 accompaniment separation, UVR
2 cut audio, less than 30 seconds for whisper, slicer

then put the dataset into the dataset_raw directory according to the following file structure

dataset_raw
├───speaker0
│   ├───000001.wav
│   ├───...
│   └───000xxx.wav
└───speaker1
    ├───000001.wav
    ├───...
    └───000xxx.wav

Install dependencies

1 software dependency

apt update && sudo apt install ffmpeg

pip install -r requirements.txt
2 download the Timbre Encoder: Speaker-Encoder by @mueller91, put best_model.pth.tar into speaker_pretrain/
3 download whisper model whisper-large-v2, Make sure to download large-v2.pt，put it into whisper_pretrain/
4 whisper is built-in, do not install it additionally, it will conflict and report an error
5 download hubert_soft model，put hubert-soft-0d54a1f4.pt into hubert_pretrain/

Data preprocessing

1， re-sampling

generate audio with a sampling rate of 16000Hz：./data_svc/waves-16k

python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-16k -s 16000

generate audio with a sampling rate of 32000Hz：./data_svc/waves-32k

python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-32k -s 32000
2， use 16K audio to extract pitch：

python prepare/preprocess_crepe.py -w data_svc/waves-16k/ -p data_svc/pitch
3， use 16K audio to extract ppg

python prepare/preprocess_ppg.py -w data_svc/waves-16k/ -p data_svc/whisper
4， use 16K audio to extract hubert

python prepare/preprocess_hubert.py -w data_svc/waves-16k/ -v data_svc/hubert
5， use 16k audio to extract timbre code

python prepare/preprocess_speaker.py data_svc/waves-16k/ data_svc/speaker
6， extract the average value of the timbre code for inference; it can also replace a single audio timbre in generating the training index, and use it as the unified timbre of the speaker for training

python prepare/preprocess_speaker_ave.py data_svc/speaker/ data_svc/singer
7， use 32k audio to extract the linear spectrum

python prepare/preprocess_spec.py -w data_svc/waves-32k/ -s data_svc/specs
8， use 32k audio to generate training index

python prepare/preprocess_train.py
9， training file debugging

python prepare/preprocess_zzz.py

data_svc/
└── waves-16k
│    └── speaker0
│    │      ├── 000001.wav
│    │      └── 000xxx.wav
│    └── speaker1
│           ├── 000001.wav
│           └── 000xxx.wav
└── waves-32k
│    └── speaker0
│    │      ├── 000001.wav
│    │      └── 000xxx.wav
│    └── speaker1
│           ├── 000001.wav
│           └── 000xxx.wav
└── pitch
│    └── speaker0
│    │      ├── 000001.pit.npy
│    │      └── 000xxx.pit.npy
│    └── speaker1
│           ├── 000001.pit.npy
│           └── 000xxx.pit.npy
└── hubert
│    └── speaker0
│    │      ├── 000001.vec.npy
│    │      └── 000xxx.vec.npy
│    └── speaker1
│           ├── 000001.vec.npy
│           └── 000xxx.vec.npy
└── whisper
│    └── speaker0
│    │      ├── 000001.ppg.npy
│    │      └── 000xxx.ppg.npy
│    └── speaker1
│           ├── 000001.ppg.npy
│           └── 000xxx.ppg.npy
└── speaker
│    └── speaker0
│    │      ├── 000001.spk.npy
│    │      └── 000xxx.spk.npy
│    └── speaker1
│           ├── 000001.spk.npy
│           └── 000xxx.spk.npy
└── singer
    ├── speaker0.spk.npy
    └── speaker1.spk.npy

Train

1， if fine-tuning based on the pre-trained model, you need to download the pre-trained model: sovits5.0_bigvgan_mix_v2.pth

set pretrain: "./sovits5.0_bigvgan_mix_v2.pth" in configs/base.yaml，and adjust the learning rate appropriately, eg 5e-5
2， start training

python svc_trainer.py -c configs/base.yaml -n sovits5.0
3， resume training

python svc_trainer.py -c configs/base.yaml -n sovits5.0 -p chkpt/sovits5.0/***.pth
4， view log

tensorboard --logdir logs/

Inference

1， export inference model: text encoder, Flow network, Decoder network

python svc_export.py --config configs/base.yaml --checkpoint_path chkpt/sovits5.0/***.pt
2， use whisper to extract content encoding, without using one-click reasoning, in order to reduce GPU memory usage

python whisper/inference.py -w test.wav -p test.ppg.npy
3， use hubert to extract content vector, without using one-click reasoning, in order to reduce GPU memory usage

python hubert/inference.py -w test.wav -v test.vec.npy
4， extract the F0 parameter to the csv text format, open the csv file in Excel, and manually modify the wrong F0 according to Audition or SonicVisualiser

python pitch/inference.py -w test.wav -p test.csv
5，specify parameters and infer

python svc_inference.py --config configs/base.yaml --model sovits5.0.pth --spk ./configs/singers/singer0001.npy --wave test.wav --ppg test.ppg.npy --vec test.vec.npy --pit test.csv

when --ppg is specified, when the same audio is reasoned multiple times, it can avoid repeated extraction of audio content codes; if it is not specified, it will be automatically extracted;

when --vec is specified, when the same audio is reasoned multiple times, it can avoid repeated extraction of audio content codes; if it is not specified, it will be automatically extracted;

when --pit is specified, the manually tuned F0 parameter can be loaded; if not specified, it will be automatically extracted;

generate files in the current directory:svc_out.wav

args --config --model --spk --wave --ppg --vec --pit --shift

name config path model path speaker wave input wave ppg wave hubert wave pitch pitch shift

args	--config	--model	--spk	--wave	--ppg	--vec	--pit	--shift
name	config path	model path	speaker	wave input	wave ppg	wave hubert	wave pitch	pitch shift

Creat singer

named by pure coincidence：average -> ave -> eva，eve(eva) represents conception and reproduction

python svc_eva.py

eva_conf = {
    './configs/singers/singer0022.npy': 0,
    './configs/singers/singer0030.npy': 0,
    './configs/singers/singer0047.npy': 0.5,
    './configs/singers/singer0051.npy': 0.5,
}

the generated singer file is：eva.spk.npy

Data set

Name	URL
KiSing	http://shijt.site/index.php/2021/05/16/kising-the-first-open-source-mandarin-singing-voice-synthesis-corpus/
PopCS	https://github.com/MoonInTheRiver/DiffSinger/blob/master/resources/apply_form.md
opencpop	https://wenet.org.cn/opencpop/download/
Multi-Singer	https://github.com/Multi-Singer/Multi-Singer.github.io
M4Singer	https://github.com/M4Singer/M4Singer/blob/master/apply_form.md
CSD	https://zenodo.org/record/4785016#.YxqrTbaOMU4
KSS	https://www.kaggle.com/datasets/bryanpark/korean-single-speaker-speech-dataset
JVS MuSic	https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_music
PJS	https://sites.google.com/site/shinnosuketakamichi/research-topics/pjs_corpus
JUST Song	https://sites.google.com/site/shinnosuketakamichi/publication/jsut-song
MUSDB18	https://sigsep.github.io/datasets/musdb.html#musdb18-compressed-stems
DSD100	https://sigsep.github.io/datasets/dsd100.html
Aishell-3	http://www.aishelltech.com/aishell_3
VCTK	https://datashare.ed.ac.uk/handle/10283/2651