/ASVA

[ECCV 2024 Oral] Audio-Synchronized Visual Animation

Primary LanguagePython

Audio-Synchronized Visual Animation

     

Lin Zhang1, Shentong Mo2, Yijing Zhang1, Pedro Morgado1

University of Wisconsin Madison1
Carnegie Mellon University2

ECCV 2024
Oral Presentation

Checklist

  • Release pretrained checkpoints
  • Release inference code on audio-conditioned image animation and sync metrics
  • Release ASVA training and evaluation code
  • Release AVSync classifier training and evaluation code
  • Release Huggingface Demo

1. Create environment

We use video_reader backend of torchvision to load audio and videos, which requires building torchvision locally

conda create -n asva python==3.10 -y
conda activate asva

pip install torch==2.1.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121

# Build torchvision from source
mkdir -p submodules
cd submodules
git clone https://github.com/pytorch/vision.git
cd vision
git checkout tags/v0.16.0
conda install -c conda-forge 'ffmpeg<4.3' -y
python setup.py install
cd ../..

pip install -r requirements.txt

export PYTHONPATH=$PYTHONPATH:$(pwd):$(pwd)/submodules/ImageBind

2. Download pretrained models

Download required features/models

Please download and structure them as following:

- submodules/
    - ImageBind/
- pretrained/
    - i3d_torchscript.pt
    - stable-diffusion-v1-5/
    - openai-clip-l_null_text_encoding.pt
    - AVID-CMA_Audioset_InstX-N1024-PosW-N64-Top32_checkpoint.pth.tar

Download pretrained AVSyncD and AVSync Classifier checkpoints

Model Dataset Checkpoint Config Audio CFG FVD AlignSync
AVSyncD AVSync15 GoogleDrive Link 1.0 323.06 22.21
4.0 300.82 22.64
8.0 375.02 22.70
Landscapes GoogleDrive Link 1.0 491.37 24.94
4.0 449.59 25.02
8.0 547.97 25.16
TheGreatestHits GoogleDrive Link 1.0 305.41 22.56
4.0 255.49 22.89
8.0 279.12 23.14
Model Dataset Checkpoint Config A2V Sync Acc V2A Sync Acc
AVSync Classifier VGGSS GoogleDrive Link 40.76 40.86

Please download checkpoints you need and structure them as following:

- checkpoints/
    - audio-cond_animation/
        - avsync15_audio-cond_cfg/
        - landscapes_audio-cond_cfg/
        - thegreatesthits_audio-cond_cfg/
    - avsync/
        - vggss_sync_contrast/

3. Demo

Generate animation on audio / image / video

The program first tries to load audio from audio and image from image. If they are not specified, the program then loads audio or image from video.

python -W ignore scripts/animation_demo.py --dataset AVSync15 --category "lions roaring" --audio_guidance 4.0 \
    --audio ./assets/lions_roaring.wav --image ./assets/lion_and_gun.png --save_path ./assets/generation_lion_roaring.mp4

python -W ignore scripts/animation_demo.py --dataset AVSync15 --category "machine gun shooting" --audio_guidance 4.0 \
    --audio ./assets/machine_gun_shooting.wav --image ./assets/lion_and_gun.png --save_path ./assets/generation_lion_shooting_gun.mp4
Lion roaring Lion shooting gun

Compute sync metrics for audio-video pairs

We have 3 metrics:

AVSync score

Raw output value of the avsync classifier for an input (audio, video) pair. It is in range (-\inf, \inf).

python -W ignore scripts/avsync_metric.py --metric avsync_score --audio {audio path} --video {video path}

RelSync

Measures synchronization of an (audio, video) pair by using a reference.

To measure synchronization audio generation, the reference is a groundtruth audio

python -W ignore scripts/avsync_metric.py --metric relsync --audio {generated audio path} --video {video path} --ref_audio {groundtruth audio path}

To measure synchronization video generation, the reference is a groundtruth video.

python -W ignore scripts/avsync_metric.py --metric relsync --audio {audio path} --video {generated video path} --ref_video {groundtruth video path}

AlignSync

Measures synchronization of an (audio, video) pair by using a reference video. It is only used to measure sync for video generation.

python -W ignore scripts/avsync_metric.py --metric alignsync --audio {audio path} --video {generated video path} --ref_video {groundtruth video path}

4. Download datasets

Each dataset has 3 files/folders:

  • videos/: the directory to store all .mp4 video files
  • train.txt: training file names
  • test.txt: testing file names

Optionally, we precomputed two files for ease of computation:

  • class_mapping.json: mapping category string in file name to text string used for conditioning
  • class_clip_text_encodings_stable-diffusion-v1-5.pt: mapping text string used for conditioning to clip text encodings

Download these files from GoogleDrive, and place them under datasets/ folder.

To download videos:

  • AVSync15: download videos from link above. (Last update: July 26 2024)
  • Landscapes: download videos from MMDiffusion.
  • TheGreatestHits: download videos from Visually Indicated Sounds.
  • VGGSS: for AVSync classifier training/evaluation, download videos from VGGSound. Only videos listed in train.txt and test.txt are needed.

Overall, the datasets folder has the following structure

- datasets/
    - AVSync15/
        - videos/
            - baby_babbling_crying/
            - cap_gun_shooting/
            - ...
        - train.txt
        - test.txt
        - class_mapping.json
        - class_clip_text_encodings_stable-diffusion-v1-5.pt
    - Landscapes/
        - videos/
            - train/
                - explosion
                - ...
            - test/
                - explosion
                - ...
            - ...
        - train.txt
        - test.txt
        - class_mapping.json
        - class_clip_text_encodings_stable-diffusion-v1-5.pt
    - TheGreatestHits/
        - videos/
            - xxxx_denoised_thumb.mp4
            - ...
        - train.txt
        - test.txt
        - class_clip_text_encodings_stable-diffusion-v1-5.pt
    - VGGSS/
        - videos/
            - air_conditioning_noise/
            - air_horn/
            - ...
        - train.txt
        - test.txt

5. Train and evaluate AVSyncD

Train

Training is done on 8 RTX-A4500 GPUs (20G) on AVSync15/Landscapes or 4 A100 GPUs on TheGreatestHits, with a total batch size of 64, accelerate for distributed training, and wandb for logging. Checkpoints will be flushed every checkpointing_steps iterations. Besides, checkpoints at the checkpointing_milestones-th iteration and the last iteration will both be saved. Please adjust these two parameters in .yaml config file to avoid important weights being flushed when you customize training recipes.

PYTHONWARNINGS="ignore" accelerate launch scripts/animation_train.py --config_file configs/audio-cond_animation/{datasetname}_audio-cond_cfg.yaml

Results are saved to exps/audio-cond_animation/{dataset}_audio-cond_cfg, with the same structure as pretrained checkpoints.

Evaluation

Evaluation is two-step:

  1. Generate 3 clips per video for test set using scripts/animation_gen.py
  2. Evaluate between groundtruth clips and generated clips using scripts/animation_eval.py

Please refer to scripts/animation_test_{dataset}.sh for the steps. For example, to evaluate AVSyncD pretrained on AVSync15 with audio guidance scale = 4.0:

bash scripts/animation_test_avsync15.sh checkpoints/audio-cond_animation/avsync15_audio-cond_cfg 37000 4.0

6. Train and evaluate AVSync Classifier

Train

AVSync Classifier is trained on VGGSS training split for 4 days, 8 RTX-A4500 GPUs, and batchsize 32.

PYTHONWARNINGS="ignore" accelerate launch scripts/avsync_train.py --config_file configs/avsync/vggss_sync_contrast.yaml

Evaluation

We follow VGGSoundSync to sample 31 clips from each video, with 0.04-s gap between neighboring clips. Given the audio/video clip at the center, we predict its synchronized video/audio clip's index. A tolerate range of 5 is applied, since human is tolerant to 0.2s asynchrony.

For example, to evaluate our pretrained AVSync Classifier on 8 GPUs, run:

PYTHONWARNINGS="ignore" accelerate launch --num_processes=8 scripts/avsync_eval.py --checkpoint checkpoints/avsync/vggss_sync_contrast/ckpts/checkpoint-40000/modules --mixed_precision fp16 

Citation

Please consider citing our paper if you find this repo useful:

@inproceedings{linz2024asva,
    title={Audio-Synchronized Visual Animation},
    author={Lin Zhang and Shentong Mo and Yijing Zhang and Pedro Morgado},
    booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
    year={2024}
}