Ruohao Guo, Liao Qu, Dantong Niu, Yanyu Qi, Wenzhen Yue, Ji Shi, Bowei Xing, Xianghua Ying*
25.07.2024: Our paper is accepted by ACM MM 2024 as Oral!
Audio-visual semantic segmentation (AVSS) aims to segment and classify sounding objects in videos with acoustic cues. However, most approaches operate on the close-set assumption and only identify pre-defined categories from training data, lacking the generalization ability to detect novel categories in practical applications. In this paper, we introduce a new task: open-vocabulary audio-visual semantic segmentation, extending AVSS task to open-world scenarios beyond the annotated label space. This is a more challenging task that requires recognizing all categories, even those that have never been seen nor heard during training. Moreover, we propose the first open-vocabulary AVSS framework, OV-AVSS, which mainly consists of two parts: 1) a universal sound source localization module to perform audio-visual fusion and locate all potential sounding objects and 2) an open-vocabulary classification module to predict categories with the help of the prior knowledge from large-scale pre-trained vision-language models. To properly evaluate the open-vocabulary AVSS, we split zero-shot training and testing subsets based on the AVSBench-semantic benchmark, namely AVSBench-OV. Extensive experiments demonstrate the strong segmentation and zero-shot generalization ability of our model on all categories. On the AVSBench-OV dataset, OV-AVSS achieves 55.43% mIoU on base categories and 29.14% mIoU on novel categories, exceeding the state-of-the-art zero-shot method by 41.88%/20.61% and open-vocabulary method by 10.2%/11.6%.
conda create -n ov_avss python==3.8 -y
conda activate ov_avss
git clone https://github.com/ruohaoguo/ovavss
cd ovavss
pip install torch torchvision
git clone https://github.com/facebookresearch/detectron2.git
python -m pip install -e detectron2
pip install -r requirements.txt
pip install git+https://github.com/openai/CLIP.git
pip install -e third_parties/mask_adapted_clip
cd ov_avss/modeling/pixel_decoder/ops
bash make.sh
pip install git+https://github.com/sennnnn/TrackEval.git
conda create -n ov_avss python==3.8 -y
conda activate ov_avss
git clone https://github.com/ruohaoguo/ovavss
cd ovavss
conda install pytorch==1.9.0 torchvision==0.10.0 cudatoolkit=11.1 -c pytorch -c nvidia
pip install -U opencv-python
git clone https://github.com/facebookresearch/detectron2
cd detectron2
pip install -e .
cd ..
pip install -r requirements.txt
pip install git+https://github.com/openai/CLIP.git
pip install -e third_parties/mask_adapted_clip
cd ov_avss/modeling/pixel_decoder/ops
bash make.sh
pip install git+https://github.com/sennnnn/TrackEval.git
- Download pretrained weight (model_final_3c8ec9.pkl, [facebook]) and put it in
./pre_models
. - Download pretrained weight (model_final_83d103.pkl, [facebook]) and put it in
./pre_models
. - Download pretrained weight (vggish-10086976.pth, [baidu(code: 1234) | OneDrive]) and put it in
./pre_models
. - Download and unzip datasets [baidu(code: 1234) | OneDrive] and put it in
./datasets
.
-
For ResNet-50 backbone: Run the following command
python train_net.py \ --config-file configs/avsbench/OV_AVSS_R50.yaml \ --num-gpus 1
-
For Swin-base backbone: Run the following command
python train_net.py \ --config-file configs/avsbench/swin/OV_AVSS_SwinB.yaml \ --num-gpus 1
-
For ResNet-50 backbone: Run the following command
-
Download the trained model (model_ov_avss_r50.pth, [baidu(code: 1234) | OneDrive]) and put it in
./pre_models
.cd demo_video python test_net_video_avsbench_r50.py
-
For Swin-base backbone: Run the following command
-
Download the trained model (model_ov_avss_swinb.pth, [baidu(code: 1234) | OneDrive]) and put it in
./pre_models
.cd demo_video python test_net_video_avsbench_swinb.py
-
Note: The results tested on Example 1 and Example 2 have a slight performance difference.
If you want to improve the usability or any piece of advice, please feel free to contant directly (ruohguo@foxmail.com).
Please consider citing our paper in your publications if the project helps your research. BibTeX reference is as follow.
@article{guo2024open,
title={Open-Vocabulary Audio-Visual Semantic Segmentation},
author={Guo, Ruohao and Qu, Liao and Niu, Dantong and Qi, Yanyu and Yue, Wenzhen and Shi, Ji and Xing, Bowei and Ying, Xianghua},
journal={arXiv preprint arXiv:2407.21721},
year={2024}
}
This repo is based on OpenVIS, Mask2Former and detectron2 Thanks for their wonderful works.