StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis

Yu Zhang, Rongjie Huang, Ruiqi Li, JinZheng He, Yan Xia, Feiyang Chen, Xinyu Duan, Baoxing Huai, Zhou Zhao | Zhejiang University, Huawei Cloud

PyTorch Implementation of StyleSinger (AAAI 2024): Style Transfer for Out-of-Domain Singing Voice Synthesis.

We provide our implementation and pre-trained models in this repository.

Visit our demo page for audio samples.

Pre-trained Models

You can use the pre-trained models we provide here. Details of each folder are as follows:

Model	Description
StyleSinger	Acousitic model (config)
HIFI-GAN	Neural Vocoder
Encoder	Emotion Encoder

Dependencies

A suitable conda environment named stylesinger can be created and activated with:

conda create -n stylesinger python=3.8
conda install --yes --file requirements.txt
conda activate stylesinger

Multi-GPU

By default, this implementation uses as many GPUs in parallel as returned by torch.cuda.device_count(). You can specify which GPUs to use by setting the CUDA_DEVICES_AVAILABLE environment variable before running the training module.

Inference towards style transfer of custom timbre and style

Here we provide a speech synthesis pipeline using StyleSinger.

Prepare StyleSinger (acoustic model): Download and put checkpoint at checkpoints/StyleSinger
Prepare HIFI-GAN (neural vocoder): Download and put checkpoint at checkpoints/hifigan
Prepare Emotion Encoder: Download and put checkpoint at checkpoints/global.pt
Prepare dataset: Download and put statistical files at data/binary/test_set
Prepare reference information: Provide a reference_audio (48k) and input target ph, target note for each ph, target note_dur for each ph, target note_type for each ph (rest: 1, lyric: 2, slur: 3), and reference audio path. Input these information in Inference/StyleSinger.py.

CUDA_VISIBLE_DEVICES=$GPU python inference/StyleSinger.py --config egs/stylesinger.yaml  --exp_name checkpoints/StyleSinger

Generated wav files are saved in infer_out by default.

Train your own model

Data Preparation

Prepare your own singing dataset or downlowad M4Singer (Note: you have to segment M4Singer and align note pitch for each ph, note duration for each ph, and note types (rest: 1, lyric: 2, slur: 3) for each ph as ep_pitches, ep_notedurs, ep_types)
Put metadata.json (including ph, word, item_name, ph_durs, wav_fn, singer, ep_pitches, ep_notedurs, ep_types for each singing voice), spker_set.json (including all singers and their id), and phone_set.json (all phonemes of your dictionary) in data/processed/style
Set processed_data_dir, binary_data_dir,valid_prefixes, test_prefixes in the config.
Download global emotion encoder to emotion_encoder_path.
Preprocess Dataset

export PYTHONPATH=.
CUDA_VISIBLE_DEVICES=$GPU python data_gen/tts/bin/binarize.py --config egs/stylesinger.yaml

Training StyleSinger

CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config egs/stylesinger.yaml  --exp_name StyleSinger --reset

Inference using StyleSinger

CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config egs/stylesinger.yaml  --exp_name StyleSinger --infer

Quick Inference

We provide a mini-set of test samples to demonstrate StyleSinger in here. Specifically, we provide samples of statistical files which is for faster IO. Please download the statistical files at data/binary/style/, while the WAV files are for listening.

Run

CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config egs/stylesinger.yaml  --exp_name StyleSinger --infer

You will find outputs in checkpoints/StyleSinger/generated_320000_/wavs, where [Ref] indicates ground truth mel results and [SVS] indicates predicted results.

Acknowledgements

This implementation uses parts of the code from the following Github repos: GenerSpeech, NATSpeech, ProDiff, DiffSinger as described in our code.

Citations

If you find this code useful in your research, please cite our work:

@inproceedings{zhang2024stylesinger,
  title={StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis},
  author={Zhang, Yu and Huang, Rongjie and Li, Ruiqi and He, JinZheng and Xia, Yan and Chen, Feiyang and Duan, Xinyu and Huai, Baoxing and Zhao, Zhou},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={38},
  number={17},
  pages={19597--19605},
  year={2024}
}

Disclaimer

Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's singing without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.

WinDB3ll/StyleSinger