Images that Sound

University of Michigan, Ann Arbor

arXiv 2024

This repository contains the code to generate images that sound, a special spectrogram that can be seen as images and played as sound.

Environment

To setup the environment, please simply run:

conda env create -f environment.yml
conda activate soundify

Pro tip: we highly recommend using mamba instead of conda for much faster environment solving and installation.

DeepFlyod: our repo also uses DeepFloyd IF. To use DeepFloyd IF, you must accept its usage conditions. To do so:

Sign up or log in to Hugging Face account.
Accept the license on the model card of DeepFloyd/IF-I-XL-v1.0.
Log in locally by running python huggingface_login.py and entering your Hugging Face Hub access token when prompted. It does not matter how you answer the Add token as git credential? (Y/n) question.

Usage

We use pretrained image latent diffusion Stable Diffusion v1.5 and pretrained audio latent diffusion Auffusion, which finetuned from Stable Diffusion. We provide the codes (including visualization) and instructions for our approach (multimodal denoising) and two proposed baselines: Imprint and SDS. We note that our code is based on the hydra, you can overwrite the parameters based on hydra.

Multimodal denoising

To create images that sound using our multimodal denoising method, run the code with config files under configs/main_denoise/experiment:

python src/main_denoise.py experiment=examples/bell

Note: our method does not have a high success rate since it's zero-shot and it highly depends on initial random noises. We recommend generating more samples such as N=100 to selectively hand-pick high-quality results.

Imprint baseline

To create images that sound using our proposed imprint baseline method, run the code with config files under configs/main_imprint/experiment:

python src/main_imprint.py experiment=examples/bell

SDS baseline

To create images that sound using our proposed multimodal SDS baseline method, run the code with config file under configs/main_sds/experiment:

python src/main_sds.py experiment=examples/bell

Note: we find that Audio SDS doesn't work for a lot of audio prompts. We hypothesize the reason is that latent diffusions don't work quite well as pixel-based diffusion for SDS.

Colorization

We also provide the colorization code under src/colorization which is adopted from Factorized Diffusion. To directly generate colorized videos with audio, run the code:

python src/colorization/create_color_video.py \
  --sample_dir /path/to/generated/sample/dir \
  --prompt "a colorful photo of [object]" \
  --num_samples 16 --guidance_scale 10 \
  --num_inference_steps 30 --start_diffusion_step 7

Note: since our generated images fall outside the distribution, we recommend running more trials (num_samples=16) to select best colorized results.

Acknowledgement

Our code is based on Lightning-Hydra-Template, diffusers, stable-dreamfusion, Diffusion-Illusions, Auffusion, and visual-anagrams. We appreciate their open-source codes.

alexandor91/images-that-sound