Audio AI Timeline Here we will keep track of the latest AI models for waveform based audio generation, starting in 2023! 2023 Date Release [Samples] Paper Code Trained Model 03.08 MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies arXiv GitHub - 04.08 AudioLDM 2: A General Framework for Audio, Music, and Speech Generation arXiv GitHub Hugging Face 14.07 Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts arXiv - - 10.07 VampNet: Music Generation via Masked Acoustic Token Modeling arXiv GitHub - 22.06 AudioPaLM: A Large Language Model That Can Speak and Listen arXiv - - 19.06 Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale PDF GitHub - 08.06 MusicGen: Simple and Controllable Music Generation arXiv GitHub Hugging Face Colab 06.06 Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias arXiv - - 01.06 Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis arXiv GitHub - 29.05 Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation arXiv - - 25.05 MeLoDy: Efficient Neural Music Generation arXiv - - 18.05 CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-training arXiv - - 18.05 SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities arXiv GitHub - 16.05 SoundStorm: Efficient Parallel Audio Generation arXiv GitHub (unofficial) - 03.05 Diverse and Vivid Sound Generation from Text Descriptions arXiv - - 02.05 Long-Term Rhythmic Video Soundtracker arXiv GitHub - 24.04 TANGO: Text-to-Audio generation using instruction tuned LLM and Latent Diffusion Model PDF GitHub Hugging Face 18.04 NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers arXiv GitHub (unofficial) - 10.04 Bark: Text-Prompted Generative Audio Model - GitHub Hugging Face Colab 03.04 AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models arXiv - - 08.03 VALL-E X: Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling arXiv - - 27.02 I Hear Your True Colors: Image Guided Audio Generation arXiv GitHub - 08.02 Noise2Music: Text-conditioned Music Generation with Diffusion Models arXiv - - 04.02 Multi-Source Diffusion Models for Simultaneous Music Generation and Separation arXiv GitHub - 30.01 SingSong: Generating musical accompaniments from singing arXiv - - 30.01 AudioLDM: Text-to-Audio Generation with Latent Diffusion Models arXiv GitHub Hugging Face 30.01 Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion arXiv GitHub - 29.01 Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models PDF - - 28.01 Noise2Music - - - 27.01 RAVE2 [Samples RAVE1] arXiv GitHub - 26.01 MusicLM: Generating Music From Text arXiv GitHub (unofficial) - 18.01 Msanii: High Fidelity Music Synthesis on a Shoestring Budget arXiv GitHub Hugging Face Colab 16.01 ArchiSound: Audio Generation with Diffusion arXiv GitHub - 05.01 VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers arXiv GitHub (unofficial) (demo) -