Recent audio generation (and audio codec) papers, including speech, music and general audios.
Year | Org. | Name | Title | Paper | Demo | Code |
---|---|---|---|---|---|---|
2020 | OpenAI | Jukebox | Jukebox: A Generative Model for Music | [2005.00341] | [demo] | [code] |
2021 | Soundstream | Soundstream: An end-to-end neural audio codec | [2107.03312] | [demo] | [code] [code] |
|
2021 | IRCAM | RAVE | RAVE: A variational autoencoder for fast and high-quality neural audio synthesis | [2111.05011] | [demo] | [code] |
2022 | Perceiver-AR | General-purpose, long-context autoregressive modeling with Perceiver AR | [2202.07765] | [demo] | [code] [code] |
|
2022 | Stanford | SASHIMI | It's raw! audio generation with state-space models | [2202.09729] | [demo] | [code] |
2022 | Baidu | A3T | A3T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing | [2203.09690] | [demo] | [code] |
2022 | SJTU | VQTTS | VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature | [2204.00768] | [demo] | [code] |
2022 | Spectrogram Diffusion | Multi-instrument Music Synthesis with Spectrogram Diffusion | [2206.05408] | [demo] | - | |
2022 | Microsoft | DelightfulTTS 2 | DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders | [2207.04646] | [demo] | - |
2022 | MuLan | Mulan: A joint embedding of music audio and natural language | [2208.12415] | - | [code] | |
2022 | AudioLM | AudioLM: a Language Modeling Approach to Audio Generation | [2209.03143] | [demo] | [code] | |
2022 | Meta AI | AudioGen | AudioGen: Textually Guided Audio Generation | [2209.15352] | [demo] | - |
2022 | Microsoft | Museformer | Museformer: Transformer with Fine- and Coarse-Grained Attention for Music Generation | [2210.10349] | [demo] | [code] |
2022 | Meta AI | Encodec | High Fidelity Neural Audio Compression | [2210.13438] | [demo] | [code] |
2022 | Meta AI | Modified AudioGen | Audio Language Modeling using Perceptually-Guided Discrete Representations | [2211.01223] | - | - |
2022 | Baidu | ERNIE-SAT | ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech | [2211.03545] | [demo] | [code] |
2023 | Microsoft | PromptTTS | PromptTTS: Controllable Text-to-Speech with Text Descriptions | [2211.12171] | [demo] | - |
2023 | Microsoft | VALL-E | Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers | [2301.02111] | [demo] | [code] |
2023 | - | Msanii | Msanii: High Fidelity Music Synthesis on a Shoestring Budget | [2301.06468] | - | [code] |
2023 | MusicLM | MusicLM: Generating Music From Text | [2301.11325] | [demo] | [code] | |
2023 | ETH | Moûsai | Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion | [2301.11757] | [demo] | [code] |
2023 | CVSSP | AudioLDM | AudioLDM: Text-to-Audio Generation with Latent Diffusion Models | [2301.12503] | [demo] | [code] |
2023 | ByteDance | Make-An-Audio | Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models | [2301.12661] | [demo] | - |
2023 | SingSong | SingSong: Generating musical accompaniments from singing | [2301.12662] | [demo] | - | |
2023 | ETH | ArchiSound | ArchiSound: Audio Generation with Diffusion | [2301.13267] | [demo] | [code] |
2023 | Tencent | InstructTTS | InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt | [2301.13662] | [demo] | - |
2023 | Sapienza University | MSDM | Multi-Source Diffusion Models for Simultaneous Music Generation and Separation | [2302.02257] | [demo] | [code] |
2023 | SPEAR-TTS | Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision | [2302.03540] | [demo] | [code] | |
2023 | Noise2Music | Noise2Music: Text-conditioned Music Generation with Diffusion Models | [2302.03917] | [demo] | - | |
2023 | CMU | MQTTS | A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech | [2302.04215] | [demo] | [code] |
2023 | Baidu | ERNIE-Music | ERNIE-Music: Text-to-Waveform Music Generation with Diffusion Models | [2302.04456] | - | - |
2023 | Microsoft | FoundationTTS | FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model | [2303.02939] | [demo] | - |
2023 | Microsoft | VALL-EX | Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling | [2303.03926] | [demo] | - |