StableSVC: Latent Diffusion Model for Singing Voice Conversion

TODO:

Debug cross-attention module;
Make dimensionality reduction adjustable in UNet;
Debug Whisper Embedding module;
Implement Classifier-Free Guidance (CFG);
Implement VAE-PatchGAN;
Implement LDM.

Propose Latent Diffusion Model (LDM)[1] for Singing Voice Conversion (SVC)

Current Implementation:

Simple diffusion for SVC (Denosing Diffusion Probabilistic Model, DDPM[2])

Whisper CNN module for processing Whisper embedding(which did not work in experiment):

Tentative Results (see /demo for audios)

See Report for more details

Link for Google Drive working directory: https://drive.google.com/drive/folders/1hY9YPVmqGFB9UIN0WWdJQCAfGAP9-G-1?usp=sharing

References:

[1] Rombach R, Blattmann A, Lorenz D, et al. High-resolution image synthesis with latent diffusion models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 10684-10695.

[2] Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models[J]. Advances in Neural Information Processing Systems, 2020, 33: 6840-6851.

[3] Liu S, Cao Y, Su D, et al. Diffsvc: A diffusion probabilistic model for singing voice conversion[C]//2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021: 741-748.

[4] Liu H, Chen Z, Yuan Y, et al. Audioldm: Text-to-audio generation with latent diffusion models[J]. arXiv preprint arXiv:2301.12503, 2023.

[5] Wang Y, Ju Z, Tan X, et al. AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models[J]. arXiv preprint arXiv:2304.00830, 2023.

AxiumCrisis61/StableSVC

StableSVC: Latent Diffusion Model for Singing Voice Conversion