The official repo for "Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes", ECCV 2024
In this paper, we propose a pixel-level segmentation task called Referring Audio-Visual Segmentation (Ref-AVS), which requires the network to densely predict whether each pixel corresponds to the given multimodal-cue expression, including dynamic audio-visual information.
-
Top-left of Fig.1 highlights the distinctions between Ref-AVS and previous tasks.
-
Fig.2 shows the proposed baseline model to process multimodal-cues.
Run the training & evaluation:
cd Ref_AVS
sh run.sh # you should change your path configs. See /configs/config.py for more details.
You can download the checkpoint here.
Core dependencies:
transformers=4.30.2
towhee=1.1.3
towhee-models=1.1.3 # Towhee is used for extracting VGGish audio feature.
If you find this work useful, please consider citing it:
@article{wang2024refavs,
title={Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes},
author={Wang, Yaoting and Sun, Peiwen and Zhou, Dongzhan and Li, Guangyao and Zhang, Honggang and Hu, Di},
journal={IEEE European Conference on Computer Vision (ECCV)},
year={2024},
}