/see2sound

Official code for SEE-2-SOUND: Zero-Shot Spatial Environment-to-Spatial Sound

Primary LanguagePythonApache License 2.0Apache-2.0

SEE-2-SOUND🔊: Zero-Shot Spatial Environment-to-Spatial Sound

Rishit Dagli1 · Shivesh Prakash1 · Rupert Wu1 · Houman Khosravani1,2,3

1University of Toronto    2Temerty Centre for Artificial Intelligence Research and Education in Medicine    3Sunnybrook Research Institute

Twitter Paper PDF Project Page

This work presents SEE-2-SOUND, a method to generate spatial audio from images, animated images, and videos to accompany the visual content. Check out our website to view some results of this work.

teaser

Installation

You could also skip this section and run this entirely in a docker container, for which you can find the instructions in Run in Docker, or using Gradio (for any HF/Gradio issues cc @jadechoghari 🤗).

First, install the pip package by running:

pip install -e git+https://github.com/see2sound/see2sound.git#egg=see2sound

Now, install all the required packages:

git clone https://github.com/see2sound/see2sound
cd see2sound
pip install -r requirements.txt

Evaluating the code (not inference though) requires the Visual Acoustic Matching (AViTAR) codebase. However, due to the many changes required to run AViTAR, you should install the codebase through a fork we host. Install this by running:

pip install git+https://github.com/Rishit-dagli/visual-acoustic-matching-s2s
OR
git clone https://github.com/Rishit-dagli/visual-acoustic-matching-s2s
cd visual-acoustic-matching-s2s
pip install -e .

Check out the Tips section for tips on installing the requirements.

Overview of codebase

SEE-2-SOUND consists of three main components: source estimation, audio generation, and surround sound spatial audio generation.

methods

In the source estimation phase, the model identifies regions of interest in the input media and estimates their 3D positions on a viewing sphere. It also estimates the monocular depth map of the input image.

Next, in the audio generation phase, the model generates mono audio clips for each identified region of interest, leveraging a pre-trained CoDi model. These audio clips are then combined with the spatial information to create a 4D representation for each region.

Finally, the model generates 5.1 surround sound spatial audio by placing sound sources in a virtual room and computing Room Impulse Responses (RIRs) for each source-microphone pair. Microphones are positioned according to the 5.1 channel configuration, ensuring compatibility with prevalent audio systems and enhancing the immersive quality of the audio output.

For evaluation, we propose a new quantitative evaluation technique and also do a user study. We propose a new method due to the difficulty in evaluating such a system, especially in the absence of any baselines. For quantitative evaluation, we produce outputs from an image to audio system (CoDi) which serves as a baseline and our system. We then run these approaches through AViTAR which edits the audio to match the visual content and then we compute similarity scores between pairs of these audio for each image.

inference

The See2Sound class has a few main methods:

  • setup that downloads and loads models into memory (in high memory mode)
  • adjust_audio that simulates a room and computes the spatial audio
  • run that puts together inference code

evaluation

The eval_See2Sound class has a few main methods

  • setup to download and load models into memory (in high memory mode)
  • generate_audio to generate mono audio
  • run_avitar to run the AViTAR model
  • compute_acoustic_similarity to compute quantitative metrics

Usage

Here is a guide on using this codebase in general it should be relatively quick to get started with this since it's well packaged as a pip package.

All of the code is designed to be run from the root of this repository.

inference

import see2sound


config_file_path = "default_config.yaml"

model = see2sound.See2Sound(config_path = config_file_path)
model.setup()
model.run(path = "test.png", output_path = "test.wav")

evaluation

You should only run evaluation to do any quantitative evaluations with the quantitative evaluation method we propose in our work.

from see2sound.evaluation import eval_See2Sound


image_dir_path = "path to a directory with all images"

evaluator = evalSee2Sound(config_path = config_file_path)
evaluator.setup()
evaluator.evaluate(image_dir_path)

Run in Docker

You could run the inference and evaluation in a container, for the purpose of writing a guide to run the container image we use Docker. However, you should be able to use any other container runtime too.

Start by building the container image by running:

docker build . -t rishitdagli/see2sound:latest

or you can also directly use the prebuilt image (41 GB compressed):

docker pull rishitdagli/see2sound:latest

You can now use docker run and start running inference or evaluation in the container with the environment setup and models pre-downloaded for you.

Build using Gradio

You could also setup the app using Gradio.

pip install -r gradio-req.txt
python app.py

Tips

We share some tips on running the code and reproducing our results.

on installing required packages

  • If you just want to run inference, I recommend using torch==2.3.0 and using this slim requirements.txt file.
  • You could find some ways to perform the quantitative evaluation with the original Visual Acoustic Matching repository, we would, however, suggest using the fork which has some additional features which are required if you want to run our code from the pip package.
  • The repository has the dependency tensorflow which is required by speech_metrics and vam. However, this is only needed for the quantitative evaluations in our work and not for inference.
  • Our codebase works with PyTorch 2.x, to this extent all of our code and results were produced with PyTorch 2.3.0, our requirements.txt file, however, has PyTorch 1.13.1 since PyTorch 1.x is required for our quantitative evaluation.

on downloading models

  • If you are running inference, the code downloads a few artifacts: Segment Anything weights, Depth Anything weights, CoDi weights, and CLIP ViT-H Tokenizer.
  • If you are running evaluation, the code downloads a few artifacts: Segment Anything weights, Depth Anything weights, CoDi weights, CLIP ViT-H Tokenizer, and AViTAR weights.
  • The paths to all of these files can be entered in the config yaml where the package will look for them or download them at. Thus one could make the inference or evaluation run without any network connection.

on compute

  • We have currently optimized the code for and run all of the experiments on a A100 - 80 GB GPU. However, we have also tested the code on a A100 - 40 GB GPU, a H100 - 80 GB GPU, and a V100 - 32 GB GPU (run with the low memory mode) where the inference and evaluation seem to work pretty fast.
  • In general, we would recommend a GPU above 40 GB vRAM, you could, however, run this on a GPU with 24 GB or more vRAM in the low memory mode (trades off-peak vRAM usage with the time taken).
  • We would recommend having at least 24 GB CPU RAM for the code to work well, ideally, we would recommend 32 GB CPU RAM though.

on running inference

  • All of our experiments were run with Segment Anything ViT-H and Depth Anything ViT-L. However, any of the models can be replaced for the smaller variants through the config yaml file or also different models altogether.
  • We would suggest running the inference for at least 500 diffusion steps and somewhere between 3 to 5 as num_audios.

Credits

This code base is built on top of, and thanks to them for maintaining the repositories:

Citation

If you find See-2-Sound helpful, please consider citing:

@misc{dagli2024see2sound,
      title={SEE-2-SOUND: Zero-Shot Spatial Environment-to-Spatial Sound}, 
      author={Rishit Dagli and Shivesh Prakash and Robert Wu and Houman Khosravani},
      year={2024},
      eprint={2406.06612},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}