This repository contains the official implementation of the following paper:
Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation Jianzong Wu*, Xiangtai Li*, Henghui Ding, Xia Li, Guangliang Cheng, Yunhai Tong, Chen Change Loy
IEEE/CVF International Conference on Computer Vision (ICCV), 2023
- 2023.7.19: Our code is publicly available.
In this work, we focus on instance-level open vocabulary segmentation, intending to expand a segmenter for instance-wise novel categories without mask annotations. We investigate a simple yet effective framework with the help of image captions, focusing on exploiting thousands of object nouns in captions to discover instances of novel classes. Rather than adopting pretrained caption models or using massive caption datasets with complex pipelines, we propose an end-to-end solution from two aspects: caption grounding and caption generation. In particular, we devise a joint Caption Grounding and Generation (CGG) framework based on a Mask Transformer baseline. The framework has a novel grounding loss that performs explicit and implicit multi-modal feature alignments. Moreover, we design a lightweight caption generation head to allow for additional caption supervision. We find that grounding and generation complement each other, significantly enhancing the segmentation performance for novel categories. We conduct extensive experiments on the COCO dataset with two settings: Open Vocabulary Instance Segmentation (OVIS) and Open Set Panoptic Segmentation (OSPS). The results demonstrate the superiority of our CGG framework over previous OVIS methods, achieving a large improvement of 6.8% mAP on novel classes without extra caption data. Our method also achieves over 15% PQ improvements for novel classes on the OSPS benchmark under various settings.
- SOTA performance: The proposed CGG achieves significant improvements on both open vocabulary instance segmenation and open-set panoptic segmentation in comparison with previous SOTA methods.
- Data/memory effiency: Our method achieves SOTA performances without training on large-scale image-text pairs, like CC3M. Besides, we do not use vision language models (VLMs) like CLIP to extract language features. We only use BERT embeddings for text features. As a result, our method has efficient data and memory effiencies compared with SOTA methods.
-
Clone Repo
git clone https://github.com/jianzongwu/betrayed-by-captions.git cd betrayed-by-captions
-
Create Conda Environment and Install Dependencies
conda create -n cgg python=3.8 conda activate cgg # install pytorch (according to your local GPU and cuda version) conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.6 -c pytorch -c conda-forge # build mmcv-full from source # This repo uses mmcv-full-1.7.1 mkdir lib cd lib git clone git@github.com:open-mmlab/mmcv.git cd mmcv pip install -r requirements/optional.txt MMCV_WITH_OPS=1 pip install -e . -v # build mmdetection from source # This repo uses mmdet-2.28.2 cd .. git clone git@github.com:open-mmlab/mmdetection.git cd mmdetection pip install -v -e . # build panopticapi from source cd .. git clone git@github.com/cocodataset/panopticapi.git cd panopticapi pip install -v -e . # install other dependencies cd ../.. pip install -r requirements.txt
Before performing the following steps, please download our pretrained models first.
We release the models for open vocabulary instance segmentation (OVIS), open vocabulary object detection (OVOD), and open-set panoptic segmentation (OSPS). For the details of OSPS, please refer to this paper.
Model | 🔗 Download Links | Task |
---|---|---|
CGG-COCO-Instances | [Google Drive] [Baidu Disk] | OVIS & OVOD |
CGG-COCO-Panoptic | [Google Drive] [Baidu Disk] | OSPS |
Then, place the models to chekpoints
directory.
The directory structure will be arranged as:
checkpoints
|- README.md
|- coco_instance_ag3x_1x.pth
|- coco_panoptic_p20.pth
We provide a jupyter notebook for inferencing our model on both OVIS and OVPS. Feel free to upload your own images to test our model's ability on various scenarios!
Dataset | COCO | ADE20K |
---|---|---|
Details | For training and evaluation | For evaluation |
Download | Official Link | ADEChanllengeData2016 |
For the COCO dataset, we use the 2017 version images and annotations. Please download train2017 and val2017 images. For annotatoins, we use captions_train2017.json
, instances_train/test2017.json
, and panoptic_train/val2017.json
.
For the evaluation on ADE20K dataset, we use the MIT Scene Parsing Benchmark validation set, which contains 100 classes. Please download the converted COCO-format annotation file form here and put it in the annotations
folder.
Please put all the datasets to the data
directory. The data
directory structure will be arranged as:
data
|- ade20k
|- ADEChallengeData2016
|- annotations
|- train
|- validation
|- ade20k_instances_val.json
|- images
|- train
|- validation
|- objectsInfo150.txt
|- sceneCategories.txt
|- coco
|- annotations
|- captions_train2017.json
|- instances_train2017.json
|- instances_val2017.json
|- panoptic_train2017.json
|- panoptic_val2017.json
|- train2017
|- val2017
We provide evaluation code for the COCO dataset.
Run the following commands for evaluation on OVIS.
Run on single GPU:
python tools/test.py \
configs/instance/coco_b48n17.py \
checkpoints/coco_instance_ag3x_1x.pth \
--eval bbox segm
Run on multiple GPUs:
bash ./tools/dist_test.sh \
configs/instance/coco_b48n17.py \
checkpoints/coco_instance_ag3x_1x.pth \
8 \
--eval bbox segm
Run the following commands for evaluation on OSPS.
Run on single GPU:
python tools/test.py \
configs/openset_panoptic/coco_panoptic_p20.py \
checkpoints/coco_panoptic_p20.pth \
--eval bbox segm
Run on multiple GPUs:
bash ./tools/dist_test.sh \
configs/openset_panoptic/coco_panoptic_p20.py \
checkpoints/coco_panoptic_p20.pth \
8 \
--eval bbox segm
You will get the scores as paper reported. The output will also be saved in work_dirs/{config_name}
.
Our model first pre-trains in an class-agnostic manner. The pre-train configs are provided in configs/instance/coco_ag_pretrain_3x
(for OVIS) and configs/openset_panoptic/p{5/10/20}_ag_pretrain
(for OSPS).
Run the following commands for class-agnotic pre-training.
Run on single GPU:
# OVIS
python tools/train.py \
configs/instance/coco_ag_pretrain_3x.py
# OSPS
python tools/train.py \
configs/openset_panoptic/p20_ag_pretrain.py
Run on multiple GPUs:
# OVIS
bash ./tools/dist_train.sh \
configs/instance/coco_ag_pretrain_3x.py \
8
# OSPS
bash ./tools/dist_train.sh \
configs/openset_panoptic/p20_ag_pretrain.py \
8
The pre-training on OVIS takes 36 epochs and may need a long time. Here we provide downloads for the class-agnostic pre-trained models.
Model | 🔗 Download Links | Task |
---|---|---|
CGG-instance-pretrain | [Google Drive] [Baidu Disk] | OVIS & OVOD |
CGG-panoptic-pretrain | [Google Drive] [Baidu Disk] | OSPS |
The directory structure will be arranged as:
pretrained
|- README.md
|- class_ag_pretrained_3x.pth
|- panoptic_p20_ag_pretrain.pth
If you perform the class-agnostic pre-training by yourself, please rename the pre-trained model saved in work_dirs
and save them into the pretrained
folder as the directory structure above. The training configs will load the pre-trained weights.
After pre-training, the open vocabulary training configs are provided in configs/instance_coco_b48n17
(for OVIS) and configs/openset_panoptic/coco_panoptic_p{5/10/20}
(for OSPS).
Run one of the following commands for training.
Run on single GPU:
# OVIS
python tools/train.py \
configs/instance/coco_b48n17.py
# OSPS
python tools/train.py \
configs/openset_panoptic/coco_panoptic_p20.py
Run on multiple GPUs:
# OVIS
bash ./tools/dist_train.sh \
configs/instance/coco_b48n17.py \
8
# OSPS
bash ./tools/dist_train.sh \
configs/openset_panoptic/coco_panoptic_p20.py \
8
The output will be saved in work_dirs/{config_name}
.
Results on OVIS:
Results on OVIS:
Results on OVIS:
If you find our repo useful for your research, please consider citing our paper:
@article{wu2023betrayed,
title={Betrayed by captions: Joint caption grounding and generation for open vocabulary instance segmentation},
author={Wu, Jianzong and Li, Xiangtai and Ding, Henghui and Li, Xia and Cheng, Guangliang and Tong, Yunhai and Loy, Chen Change},
journal={arXiv preprint arXiv:2301.00805},
year={2023}
}
If you have any question, please feel free to contact us via jzwu@stu.pku.edu.cn
or xiangtai.li@ntu.edu.sg
.
Licensed under a Creative Commons Attribution-NonCommercial 4.0 International for Non-commercial use only. Any commercial use should get formal permission first.
This repository is maintained by Jianzong Wu and Xiangtai Li.
This code is based on MMDetection.