Clone of singing voice conversion based on DDSP.
Samples or Link to demo page.
Features compared to Diff-SVC and SO-VITS-SVC:
- light training
- fast inference
- Equivalent or lower quality
- sub-optimal quality w/o enhancer
- optimal quality w/ enhancer (close to SO-VITS-SVC, not the level of Diff-SVC)
# pip install "torch==1.11.0" -q # Based on your environment (validated with 1.9.1)
# pip install "torchaudio==0.11.0" -q # Based on your environment (validated with 0.6.0)
# pip install git+https://github.com/tarepan/DDSP-SVC-official
git clone https://github.com/tarepan/DDSP-SVC-official
pip install -r requirements.txt
Then, configure the pretrained models:
- (Required) Prepare encoder
- TypeA: Download the pretrained HubertSoft encoder and put it under
pretrain/hubert
folder. - TypeB: Download the pretrained ContentVec
- TypeA: Download the pretrained HubertSoft encoder and put it under
- Get the pretrained vocoder-based enhancer from the DiffSinger Community Vocoders Project and unzip it into
pretrain/
folder
First, place audio files under the directory structured like below:
data/
val/audio/
train/audio/
1/
2/
ccc.wav
ddd.wav
Speaker folder name should be positive integers not greater than 'n_spk' to represent speaker ids.
You can also run
python draw.py
to help you select validation data (you can adjust the parameters in draw.py
to modify the number of extracted files and other parameters)
Then run:
python preprocess.py -c <seletct_config_as___configs/combsub.yaml>
- Configs
- sampling rate: 44.1khz
- commons: optimized for NVIDIA GTX 1660
- Restrictions
- assert 'audio_sr == config_sr'. If not, training becomes very slow by resampling.
- assert '2sec <= len(audio) <= not_too_long'
- assert 'n_clip ~ 1000 if cache_all_data is True' because of on-memory cache size
- assert 'n_val_clip <= 10' because of validation cost
Jump to ☞ , then Run. That's all!
For arguments, check ./ddspsvc/config.py.
For dataset, check speechcorpusy
.
python train.py -c <seletct_config_as___configs/combsub.yaml>
The test audio samples in Tensorboard are the outputs of DDSP-SVC w/o enhancer.
You can safely interrupt training, then running the same command line will resume training.
You can also finetune the model if you interrupt training first, then re-preprocess the new dataset or change the training parameters (batchsize, lr etc.) and then run the same command line.
Both CLI and Python supported.
For detail, jump to ☞ and check it.
Pretrained model is provided in GitHub release.
With provided model or your trained model, run:
# Pure DDSP
python main.py -i <input.wav> -m <model_file.pt> -o <output.wav> -k <keychange (semitones)> -id <speaker_id> -e false
# PPSP + enhancer
## if normal vocal range, set `enhancer_adaptive_key` to 0, else to >0
python main.py -i <input.wav> -m <model_file.pt> -o <output.wav> -k <keychange (semitones)> -id <speaker_id> -eak <enhancer_adaptive_key (semitones)>
python main.py -h
UPDATE:Mix-speaker is supported now. You can use "-mix" option to design your own vocal timbre, below is an example:
# Mix the timbre of 1st and 2nd speaker in a 0.5 to 0.5 ratio
python main.py -i <input.wav> -m <model_file.pt> -o <output.wav> -k <keychange (semitones)> -mix "{1:0.5, 2:0.5}" -eak 0
Start a simple GUI with the following command:
python gui.py
The front-end uses technologies such as sliding window, cross-fading, SOLA-based splicing and contextual semantic reference, which can achieve sound quality close to non-real-time synthesis with low latency and resource occupation.
Update: A splicing algorithm based on a phase vocoder is now added, but in most cases the SOLA algorithm already has high enough splicing sound quality, so it is turned off by default. If you are pursuing extreme low-latency real-time sound quality, you can consider turning it on and tuning the parameters carefully, and there is a possibility that the sound quality will be higher. However, a large number of tests have found that if the cross-fade time is longer than 0.1 seconds, the phase vocoder will cause a significant degradation in sound quality.
- training
- x.x [iter/sec] @ NVIDIA X0 on Google Colaboratory (AMP+)
- take about y days for whole training
- inference
- z.z [sec/sample] @ xx