Lin Zhang1, Shentong Mo2, Yijing Zhang1, Pedro Morgado1
University of Wisconsin Madison1
Carnegie Mellon University2
ECCV 2024
Oral Presentation
- Release pretrained checkpoints
- Release inference code on audio-conditioned image animation and sync metrics
- Release ASVA training and evaluation code
- Release AVSync classifier training and evaluation code
- Release Huggingface Demo
We use video_reader
backend of torchvision to load audio and videos, which requires building torchvision locally
conda create -n asva python==3.10 -y
conda activate asva
pip install torch==2.1.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121
# Build torchvision from source
mkdir -p submodules
cd submodules
git clone https://github.com/pytorch/vision.git
cd vision
git checkout tags/v0.16.0
conda install -c conda-forge 'ffmpeg<4.3' -y
python setup.py install
cd ../..
pip install -r requirements.txt
export PYTHONPATH=$PYTHONPATH:$(pwd):$(pwd)/submodules/ImageBind
- ImageBind: Pretrained frozen audio encoder
- I3D: Evaluating FVD
- Stable Diffusion V1.5: Load pretrained image generation model
- AVID-CMA: Initialize AVSync Classifier's encoders
- Precomputed null text encodings: Ease of computatoin
Please download and structure them as following:
- submodules/
- ImageBind/
- pretrained/
- i3d_torchscript.pt
- stable-diffusion-v1-5/
- openai-clip-l_null_text_encoding.pt
- AVID-CMA_Audioset_InstX-N1024-PosW-N64-Top32_checkpoint.pth.tar
Model | Dataset | Checkpoint | Config | Audio CFG | FVD | AlignSync |
---|---|---|---|---|---|---|
AVSyncD | AVSync15 | GoogleDrive | Link | 1.0 | 323.06 | 22.21 |
4.0 | 300.82 | 22.64 | ||||
8.0 | 375.02 | 22.70 | ||||
Landscapes | GoogleDrive | Link | 1.0 | 491.37 | 24.94 | |
4.0 | 449.59 | 25.02 | ||||
8.0 | 547.97 | 25.16 | ||||
TheGreatestHits | GoogleDrive | Link | 1.0 | 305.41 | 22.56 | |
4.0 | 255.49 | 22.89 | ||||
8.0 | 279.12 | 23.14 |
Model | Dataset | Checkpoint | Config | A2V Sync Acc | V2A Sync Acc |
---|---|---|---|---|---|
AVSync Classifier | VGGSS | GoogleDrive | Link | 40.76 | 40.86 |
Please download checkpoints you need and structure them as following:
- checkpoints/
- audio-cond_animation/
- avsync15_audio-cond_cfg/
- landscapes_audio-cond_cfg/
- thegreatesthits_audio-cond_cfg/
- avsync/
- vggss_sync_contrast/
The program first tries to load audio from audio
and image from image
.
If they are not specified, the program then loads audio or image from video
.
python -W ignore scripts/animation_demo.py --dataset AVSync15 --category "lions roaring" --audio_guidance 4.0 \
--audio ./assets/lions_roaring.wav --image ./assets/lion_and_gun.png --save_path ./assets/generation_lion_roaring.mp4
python -W ignore scripts/animation_demo.py --dataset AVSync15 --category "machine gun shooting" --audio_guidance 4.0 \
--audio ./assets/machine_gun_shooting.wav --image ./assets/lion_and_gun.png --save_path ./assets/generation_lion_shooting_gun.mp4
We have 3 metrics:
Raw output value of the avsync classifier for an input (audio, video) pair. It is in range (-\inf, \inf).
python -W ignore scripts/avsync_metric.py --metric avsync_score --audio {audio path} --video {video path}
Measures synchronization of an (audio, video) pair by using a reference.
To measure synchronization audio generation, the reference is a groundtruth audio
python -W ignore scripts/avsync_metric.py --metric relsync --audio {generated audio path} --video {video path} --ref_audio {groundtruth audio path}
To measure synchronization video generation, the reference is a groundtruth video.
python -W ignore scripts/avsync_metric.py --metric relsync --audio {audio path} --video {generated video path} --ref_video {groundtruth video path}
Measures synchronization of an (audio, video) pair by using a reference video. It is only used to measure sync for video generation.
python -W ignore scripts/avsync_metric.py --metric alignsync --audio {audio path} --video {generated video path} --ref_video {groundtruth video path}
Each dataset has 3 files/folders:
videos/
: the directory to store all .mp4 video filestrain.txt
: training file namestest.txt
: testing file names
Optionally, we precomputed two files for ease of computation:
class_mapping.json
: mapping category string in file name to text string used for conditioningclass_clip_text_encodings_stable-diffusion-v1-5.pt
: mapping text string used for conditioning to clip text encodings
Download these files from GoogleDrive, and place them under datasets/
folder.
To download videos:
- AVSync15: download videos from link above. (Last update: July 26 2024)
- Landscapes: download videos from MMDiffusion.
- TheGreatestHits: download videos from Visually Indicated Sounds.
- VGGSS: for AVSync classifier training/evaluation, download videos from VGGSound. Only videos listed in
train.txt
andtest.txt
are needed.
Overall, the datasets
folder has the following structure
- datasets/
- AVSync15/
- videos/
- baby_babbling_crying/
- cap_gun_shooting/
- ...
- train.txt
- test.txt
- class_mapping.json
- class_clip_text_encodings_stable-diffusion-v1-5.pt
- Landscapes/
- videos/
- train/
- explosion
- ...
- test/
- explosion
- ...
- ...
- train.txt
- test.txt
- class_mapping.json
- class_clip_text_encodings_stable-diffusion-v1-5.pt
- TheGreatestHits/
- videos/
- xxxx_denoised_thumb.mp4
- ...
- train.txt
- test.txt
- class_clip_text_encodings_stable-diffusion-v1-5.pt
- VGGSS/
- videos/
- air_conditioning_noise/
- air_horn/
- ...
- train.txt
- test.txt
Training is done on 8 RTX-A4500 GPUs (20G) on AVSync15/Landscapes or 4 A100 GPUs on TheGreatestHits, with a total batch size of 64, accelerate for distributed training, and wandb for logging.
Checkpoints will be flushed every checkpointing_steps
iterations.
Besides, checkpoints at the checkpointing_milestones
-th iteration and the last iteration will both be saved.
Please adjust these two parameters in .yaml config file to avoid important weights being flushed when you customize training recipes.
PYTHONWARNINGS="ignore" accelerate launch scripts/animation_train.py --config_file configs/audio-cond_animation/{datasetname}_audio-cond_cfg.yaml
Results are saved to exps/audio-cond_animation/{dataset}_audio-cond_cfg
, with the same structure as pretrained checkpoints.
Evaluation is two-step:
- Generate 3 clips per video for test set using
scripts/animation_gen.py
- Evaluate between groundtruth clips and generated clips using
scripts/animation_eval.py
Please refer to scripts/animation_test_{dataset}.sh
for the steps.
For example, to evaluate AVSyncD pretrained on AVSync15 with audio guidance scale = 4.0:
bash scripts/animation_test_avsync15.sh checkpoints/audio-cond_animation/avsync15_audio-cond_cfg 37000 4.0
AVSync Classifier is trained on VGGSS training split for 4 days, 8 RTX-A4500 GPUs, and batchsize 32.
PYTHONWARNINGS="ignore" accelerate launch scripts/avsync_train.py --config_file configs/avsync/vggss_sync_contrast.yaml
We follow VGGSoundSync to sample 31 clips from each video, with 0.04-s gap between neighboring clips. Given the audio/video clip at the center, we predict its synchronized video/audio clip's index. A tolerate range of 5 is applied, since human is tolerant to 0.2s asynchrony.
For example, to evaluate our pretrained AVSync Classifier on 8 GPUs, run:
PYTHONWARNINGS="ignore" accelerate launch --num_processes=8 scripts/avsync_eval.py --checkpoint checkpoints/avsync/vggss_sync_contrast/ckpts/checkpoint-40000/modules --mixed_precision fp16
Please consider citing our paper if you find this repo useful:
@inproceedings{linz2024asva,
title={Audio-Synchronized Visual Animation},
author={Lin Zhang and Shentong Mo and Yijing Zhang and Pedro Morgado},
booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
year={2024}
}