⚠️ Warning: This website may not function properly on Safari. For the best experience, please use Google Chrome.

arXiv: Stable Audio Open paper

HuggingFace: model weights

stable-audio-tools: code to reproduce Stable Audio

stable-audio-metrics: code to evaluate Stable Audio

Stable Audio Open generates variable-length (up to 47s) stereo audio at 44.1kHz from text prompts. It comprises three components: an autoencoder that compresses waveforms into a manageable sequence length, a T5-based text embedding for text conditioning, and a transformer-based diffusion (DiT) model that operates in the latent space of the autoencoder.

Generations by the community

Prompt: Pinball bumper.

Audio not supported by your browser.

Prompt: 80s drum beat.

Audio not supported by your browser.

Prompt: 80s bass guitar.

Audio not supported by your browser.

Prompt: Slap mandolin.

Audio not supported by your browser.

Generations from AudioCaps prompts

Prompt: Rain is falling and hitting surfaces and then splashing into puddles.

Stable Audio Open Stable Audio 2.0 AudioLDM2-48kHz
Audio not supported by your browser. Audio not supported by your browser. Audio not supported by your browser.

Prompt: A train horn goes off loudly.

Stable Audio Open Stable Audio 2.0 AudioLDM2-48kHz
Audio not supported by your browser. Audio not supported by your browser. Audio not supported by your browser.

Prompt: Gurgling and splashing water.

Stable Audio Open Stable Audio 2.0 AudioLDM2-48kHz
Audio not supported by your browser. Audio not supported by your browser. Audio not supported by your browser.

Prompt: An engine throttles and clanks and then suddenly accelerates off into the distance.

Stable Audio Open Stable Audio 2.0 AudioLDM2-48kHz
Audio not supported by your browser. Audio not supported by your browser. Audio not supported by your browser.

Generations from Song Describer Dataset prompts

Prompt: A dance music club banger, with a heavy kick, subtle supporting percussion like tabla and bongos, prominent pop synth lines, and a repetitive hook.

Stable Audio Open Stable Audio 2.0 MusicGen-large-stereo
Audio not supported by your browser. Audio not supported by your browser. Audio not supported by your browser.

Prompt: A danceable electronic track in the genre of dance

Stable Audio Open Stable Audio 2.0 MusicGen-large-stereo
Audio not supported by your browser. Audio not supported by your browser. Audio not supported by your browser.

Prompt: Fast beat, hip hop, upbeat that has a positive vibe.

Stable Audio Open Stable Audio 2.0 MusicGen-large-stereo
Audio not supported by your browser. Audio not supported by your browser. Audio not supported by your browser.

Prompt: An instrumental song which employs a worldbeat element through its eerie percussion

Stable Audio Open Stable Audio 2.0 MusicGen-large-stereo
Audio not supported by your browser. Audio not supported by your browser. Audio not supported by your browser.

Memorization analysis

Recent works examined the potential of generative models to memorize training data, especially for repeated elements in the training set. Adhering to principles of responsible model development, we also run a comprehensive study on memorization.

In light of the possible risk of memorizing repeated audio within the training set, we start by studying if our dataset contains repeated data. We embed all our training data using the LAION-CLAP audio encoder to select audios that are close in this space based on a manually set threshold. The threshold is set such that the selected audio correspond to exact replicas. With this process, we identify 3,693 Freesound and 856 FMA repeated audios.

Our methodology is based on comparing our model's generations against the training set in LAION-CLAP space. We then select the top-50 generations that are closest to the training data (the memorization candidates) and listen. We listened to memorization candidates generated with prompts from the identified repeated data in our training set, and did not find memorization. We also listened to memorization candidates from 11,000 random prompts from the training set, and did not find memorization. We even listened to memorization candidates from outstanding generations, and did not find memorization. The most interesting memorization candidates, together with their closest training data, are listed here. We extensively listened to potential memorization candidates, and could not find memorization. Those are the most interesting candidates from training data prompts:

Generation by our model Closest training data Prompt
Audio not supported by your browser. link Scale, clarinet, Asharpmajor, neumann-U87, good-sounds.
Audio not supported by your browser. link Disturb, no-signal, tv, noise, radio, high-disturbance, frequency-jam, white-noise.
Audio not supported by your browser. link 120, bpm, beat, Drums, blues, loop.
Audio not supported by your browser. link Thunder, storm, field-recording, rain.
Audio not supported by your browser. link Avant-garde, improv, contemporary classical, instrumental.
Audio not supported by your browser. link Piano, modern jazz, minimalism, instrumental.
Audio not supported by your browser. link 160-BPM, kick-hat-snare, drumloop.
Audio not supported by your browser. link 1000Hz 48k sample rate MP3.

Autoencoder reconstructions

This comparison is useful to evaluate the audio fidelity capabilities of the autoencoder. On the left, we have the ground truth recording. On the right, we take the ground truth recording and end pass it through the any of those autoencoders or neural audio codecs.

Ground truth Stable Audio Open Stable Audio 2.0 DAC
Audio not supported by your browser. Audio not supported by your browser. Audio not supported by your browser. Audio not supported by your browser.
Audio not supported by your browser. Audio not supported by your browser. Audio not supported by your browser. Audio not supported by your browser.
Audio not supported by your browser. Audio not supported by your browser. Audio not supported by your browser. Audio not supported by your browser.
Audio not supported by your browser. Audio not supported by your browser. Audio not supported by your browser. Audio not supported by your browser.
Audio not supported by your browser. Audio not supported by your browser. Audio not supported by your browser. Audio not supported by your browser.
Audio not supported by your browser. Audio not supported by your browser. Audio not supported by your browser. Audio not supported by your browser.
Audio not supported by your browser. Audio not supported by your browser. Audio not supported by your browser. Audio not supported by your browser.