awesome-diffusion4speech-papers

Paper List

Speech Synthesis

  • Wavegrad2: Iterative refinement for text-to-speech synthesis (21.06), Chen et al. [pdf]

  • Diff-tts: A denoising diffusion model for text-to-speech (INTERSPEECH 2021), Jeong et al. [pdf]

  • Grad-tts: A diffusion probabilistic model for text-to-speech (ICML 2021), Popov et al. [pdf]

  • DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (21.05), Liu et al. [pdf]

  • PRIORGRAD: IMPROVING CONDITIONAL DENOISING DIFFUSION MODELS WITH DATA-DEPENDENT ADAPTIVE PRIOR (ICLR 2022), Lee et al. [pdf]

  • Diffgan-tts: High-fidelity and efficient text-to-speech with denoising diffusion gans (22.01), Liu et al. [pdf]

  • BDDM: Bilateral denoising diffusion models for fast and high-quality speech synthesis (ICLR 2022), W.Y.Lam et al. [pdf]

  • Prodiff: Progressive fast diffusion model for high-quality text-to-speech (ACMMM 2022), Huang et al. [pdf]

  • Zero-shot voice conditioning for denoising diffusion tts models (22.06), Levkovitch et al. [pdf]

  • Fastdiff: A fast conditional diffusion model for high-quality speech synthesis (IJCAI 2022), Huang et al. [pdf]

  • FastDiff 2: Dually Incorporating GANs into Diffusion Models for High-Quality Speech Synthesis(22.09), Huang et al. [pdf]

  • Guided-tts: A diffusion model for text-to-speech via classifier guidance (ICML 2022), Kim et al. [pdf]

  • Guided-tts 2: A diffusion model for high-quality adaptive text-to-speech with untranscribed data (22.05), Kim et al. [pdf]

  • Prosody-TTS: Self-Supervised Prosody Pretraining with Latent Diffusion For Text-to-Speech (22.09), Huang et al. [pdf]

  • GAN You Hear Me? Reclaiming Unconditional Speech Synthesis from Diffusion Models (22.10), Baas et al. [pdf]

  • EmoDiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance (22.11), Kang et al. [pdf]

  • Grad-StyleSpeech: Any-speaker Adaptive Text-to-Speech Synthesis with Diffusion Models (22.11), Kang et al. [pdf]

  • NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-robust Expressive TTS (22.11), Yang et al. [pdf]

  • ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech (22.12), Chen et al. [pdf]

  • Text-to-speech synthesis based on latent variable conversion using diffusion probabilistic model and variational autoencoder (22.12), Yasuda et al. [pdf]

  • InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt (23.01), Yang et al. [pdf]

  • AN INVESTIGATION INTO THE ADAPTABILITY OF A DIFFUSION-BASED TTS MODEL (23.03), Chen et al. [pdf]

  • NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers (23.04), Shen et al. [pdf]

Automatic Speech Recognition

  • TransFusion: Transcribing Speech with Multinomial Diffusion (22.10), Baas et al. [pdf]

Speech Enhancement

  • Conditional diffusion probabilistic model for speech enhancement (ICASSP 2022), Lu et al. [pdf]

  • Universal speech enhancement with score-based diffusion (22.06), Serrà et al. [pdf]

  • Speech enhancement and dereverberation with diffusion-based generative models (22.08), Richter et al. [pdf]

  • Cold Diffusion for Speech Enhancement (22.11), Yen et al. [pdf]

Voice Conversion

  • Diffsvc: A diffusion probabilistic model for singing voice conversion (ASRU 2021), Liu et al. [pdf]

Speech Edit

  • AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models (23.04), Wang et al. [pdf]

Audio generation

  • Diffsound: Discrete diffusion model for text-to-sound generation (22.07), Yang et al. [pdf]

  • AudioLDM: Text-to-Audio Generation with Latent Diffusion Models (23.01), Liu et al. [pdf]

  • Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models (23.01), Huang et al. [pdf]

Music generation

  • Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion (23.01), Schneider et al. [pdf]

  • Noise2Music: Text-conditioned Music Generation with Diffusion Models (23.01), Huang et al. [pdf]

  • ERNIE-Music: Text-to-Waveform Music Generation with Diffusion Models (23.02), Zhu et al. [pdf]