Yu Zhang, Rongjie Huang, Ruiqi Li, JinZheng He, Yan Xia, Feiyang Chen, Xinyu Duan, Baoxing Huai, Zhou Zhao | Zhejiang University, Huawei Cloud
PyTorch Implementation of StyleSinger (AAAI 2024): Style Transfer for Out-of-Domain Singing Voice Synthesis.
We provide our implementation and pre-trained models in this repository.
Visit our demo page for audio samples.
You can use the pre-trained models we provide here. Details of each folder are as follows:
Model | Description |
---|---|
StyleSinger | Acousitic model (config) |
HIFI-GAN | Neural Vocoder |
Encoder | Emotion Encoder |
A suitable conda environment named stylesinger
can be created
and activated with:
conda create -n stylesinger python=3.8
conda install --yes --file requirements.txt
conda activate stylesinger
By default, this implementation uses as many GPUs in parallel as returned by torch.cuda.device_count()
.
You can specify which GPUs to use by setting the CUDA_DEVICES_AVAILABLE
environment variable before running the training module.
Here we provide a speech synthesis pipeline using StyleSinger.
- Prepare StyleSinger (acoustic model): Download and put checkpoint at
checkpoints/StyleSinger
- Prepare HIFI-GAN (neural vocoder): Download and put checkpoint at
checkpoints/hifigan
- Prepare Emotion Encoder: Download and put checkpoint at
checkpoints/global.pt
- Prepare dataset: Download and put statistical files at
data/binary/test_set
- Prepare reference information: Provide a reference_audio (48k) and input target ph, target note for each ph, target note_dur for each ph, target note_type for each ph (rest: 1, lyric: 2, slur: 3), and reference audio path. Input these information in
Inference/StyleSinger.py
.
CUDA_VISIBLE_DEVICES=$GPU python inference/StyleSinger.py --config egs/stylesinger.yaml --exp_name checkpoints/StyleSinger
Generated wav files are saved in infer_out
by default.
- Prepare your own singing dataset or downlowad M4Singer (Note: you have to segment M4Singer and align note pitch for each ph, note duration for each ph, and note types (rest: 1, lyric: 2, slur: 3) for each ph as ep_pitches, ep_notedurs, ep_types)
- Put
metadata.json
(including ph, word, item_name, ph_durs, wav_fn, singer, ep_pitches, ep_notedurs, ep_types for each singing voice),spker_set.json
(including all singers and their id), andphone_set.json
(all phonemes of your dictionary) indata/processed/style
- Set
processed_data_dir
,binary_data_dir
,valid_prefixes
,test_prefixes
in the config. - Download global emotion encoder to
emotion_encoder_path
. - Preprocess Dataset
export PYTHONPATH=.
CUDA_VISIBLE_DEVICES=$GPU python data_gen/tts/bin/binarize.py --config egs/stylesinger.yaml
CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config egs/stylesinger.yaml --exp_name StyleSinger --reset
CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config egs/stylesinger.yaml --exp_name StyleSinger --infer
We provide a mini-set of test samples to demonstrate StyleSinger in here. Specifically, we provide samples of statistical files which is for faster IO. Please download the statistical files at data/binary/style/
, while the WAV files are for listening.
Run
CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config egs/stylesinger.yaml --exp_name StyleSinger --infer
You will find outputs in checkpoints/StyleSinger/generated_320000_/wavs
, where [Ref] indicates ground truth mel results and [SVS] indicates predicted results.
This implementation uses parts of the code from the following Github repos: GenerSpeech, NATSpeech, ProDiff, DiffSinger as described in our code.
If you find this code useful in your research, please cite our work:
@inproceedings{zhang2024stylesinger,
title={StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis},
author={Zhang, Yu and Huang, Rongjie and Li, Ruiqi and He, JinZheng and Xia, Yan and Chen, Feiyang and Duan, Xinyu and Huai, Baoxing and Zhao, Zhou},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={38},
number={17},
pages={19597--19605},
year={2024}
}
Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's singing without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.