03.05 |
Diverse and Vivid Sound Generation from Text Descriptions |
arXiv |
- |
- |
25.04 |
AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head |
arXiv |
GitHub |
Hugging Face |
24.04 |
TANGO: Text-to-Audio generation using instruction tuned LLM and Latent Diffusion Model |
PDF |
GitHub |
Hugging Face |
18.04 |
NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers |
arXiv |
GitHub (unofficial) |
- |
10.04 |
Bark: Text-Prompted Generative Audio Model |
- |
GitHub |
Hugging Face Colab |
03.04 |
AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models |
arXiv |
- |
- |
29.03 |
Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos |
arXiv |
GitHub |
- |
08.03 |
VALL-E X: Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling |
arXiv |
- |
- |
27.02 |
I Hear Your True Colors: Image Guided Audio Generation |
arXiv |
GitHub |
- |
09.02 |
ERNIE-Music: Text-to-Waveform Music Generation with Diffusion Models |
arXiv |
- |
- |
08.02 |
Noise2Music: Text-conditioned Music Generation with Diffusion Models |
arXiv |
- |
- |
04.02 |
Multi-Source Diffusion Models for Simultaneous Music Generation and Separation |
arXiv |
GitHub |
- |
30.01 |
SingSong: Generating musical accompaniments from singing |
arXiv |
- |
- |
30.01 |
AudioLDM: Text-to-Audio Generation with Latent Diffusion Models |
arXiv |
GitHub |
Hugging Face |
30.01 |
Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion |
arXiv |
GitHub |
- |
29.01 |
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models |
PDF |
- |
- |
28.01 |
Noise2Music |
- |
- |
- |
27.01 |
RAVE2 [Samples RAVE1] |
arXiv |
GitHub |
- |
26.01 |
MusicLM: Generating Music From Text |
arXiv |
GitHub (unofficial) |
- |
18.01 |
Msanii: High Fidelity Music Synthesis on a Shoestring Budget |
arXiv |
GitHub |
Hugging Face Colab |
16.01 |
ArchiSound: Audio Generation with Diffusion |
arXiv |
GitHub |
- |
05.01 |
VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers |
arXiv |
GitHub (unofficial) (demo) |
- |
Date |
Release [Samples] |
Paper |
Code |
Trained Model |
----- |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
--------------------------------------------- |
----------------------------------------------------------------------------- |
------------------------------------------------------------------------------------------------------------------------------------------------------------- |
27.02 |
Continuous descriptor-based control for deep audio synthesis |
arXiv |
- |
- |
09.02 |
ERNIE-Music: Text-to-Waveform Music Generation with Diffusion Models |
arXiv |
- |
- |
08.02 |
Noise2Music: Text-conditioned Music Generation with Diffusion Models |
arXiv |
- |
- |
04.02 |
Multi-Source Diffusion Models for Simultaneous Music Generation and Separation |
arXiv |
GitHub |
- |
Date |
Release [Samples] |
Paper |
Code |
Trained Model |
----- |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
--------------------------------------------- |
----------------------------------------------------------------------------- |
------------------------------------------------------------------------------------------------------------------------------------------------------------- |
09.02 |
ERNIE-Music: Text-to-Waveform Music Generation with Diffusion Models |
arXiv |
- |
- |
08.02 |
Noise2Music: Text-conditioned Music Generation with Diffusion Models |
arXiv |
- |
- |
04.02 |
Multi-Source Diffusion Models for Simultaneous Music Generation and Separation |
arXiv |
GitHub |
- |
31.01 |
InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt |
arXiv |
GitHub |
- |
30.01 |
SingSong: Generating musical accompaniments from singing |
arXiv |
- |
- |
30.01 |
AudioLDM: Text-to-Audio Generation with Latent Diffusion Models |
arXiv |
GitHub |
Hugging Face |
30.01 |
Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion |
arXiv |
GitHub |
- |
29.01 |
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models |
PDF |
- |
- |
28.01 |
Noise2Music |
- |
- |
- |
27.01 |
RAVE2 [Samples RAVE1] |
arXiv |
GitHub |
- |
26.01 |
MusicLM: Generating Music From Text |
arXiv |
GitHub (unofficial) |
- |
18.01 |
Msanii: High Fidelity Music Synthesis on a Shoestring Budget |
arXiv |
GitHub |
Hugging Face Colab |
16.01 |
ArchiSound: Audio Generation with Diffusion |
arXiv |
GitHub |
- |
05.01 |
VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers |
arXiv |
- |
- |