/betrayed-by-captions

(ICCV 2023) Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation

Primary LanguageJupyter Notebook

CGG (ICCV 2023)

This repository contains the official implementation of the following paper:

Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation Jianzong Wu*, Xiangtai Li*, Henghui Ding, Xia Li, Guangliang Cheng, Yunhai Tong, Chen Change Loy
IEEE/CVF International Conference on Computer Vision (ICCV), 2023

[Paper] [Project]

⭐ News

  • 2023.7.19: Our code is publicly available.

Short Introduction

In this work, we focus on instance-level open vocabulary segmentation, intending to expand a segmenter for instance-wise novel categories without mask annotations. We investigate a simple yet effective framework with the help of image captions, focusing on exploiting thousands of object nouns in captions to discover instances of novel classes. Rather than adopting pretrained caption models or using massive caption datasets with complex pipelines, we propose an end-to-end solution from two aspects: caption grounding and caption generation. In particular, we devise a joint Caption Grounding and Generation (CGG) framework based on a Mask Transformer baseline. The framework has a novel grounding loss that performs explicit and implicit multi-modal feature alignments. Moreover, we design a lightweight caption generation head to allow for additional caption supervision. We find that grounding and generation complement each other, significantly enhancing the segmentation performance for novel categories. We conduct extensive experiments on the COCO dataset with two settings: Open Vocabulary Instance Segmentation (OVIS) and Open Set Panoptic Segmentation (OSPS). The results demonstrate the superiority of our CGG framework over previous OVIS methods, achieving a large improvement of 6.8% mAP on novel classes without extra caption data. Our method also achieves over 15% PQ improvements for novel classes on the OSPS benchmark under various settings.

teaser

Demo

demo

Overview

overall_structure

🚀 Highlights:

  • SOTA performance: The proposed CGG achieves significant improvements on both open vocabulary instance segmenation and open-set panoptic segmentation in comparison with previous SOTA methods.
  • Data/memory effiency: Our method achieves SOTA performances without training on large-scale image-text pairs, like CC3M. Besides, we do not use vision language models (VLMs) like CLIP to extract language features. We only use BERT embeddings for text features. As a result, our method has efficient data and memory effiencies compared with SOTA methods.

Dependencies and Installation

  1. Clone Repo

    git clone https://github.com/jianzongwu/betrayed-by-captions.git
    cd betrayed-by-captions
  2. Create Conda Environment and Install Dependencies

     conda create -n cgg python=3.8
     conda activate cgg
    
     # install pytorch (according to your local GPU and cuda version)
     conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.6 -c pytorch -c conda-forge
    
     # build mmcv-full from source
     # This repo uses mmcv-full-1.7.1
     mkdir lib
     cd lib
     git clone git@github.com:open-mmlab/mmcv.git
     cd mmcv
     pip install -r requirements/optional.txt
     MMCV_WITH_OPS=1 pip install -e . -v
    
     # build mmdetection from source
     # This repo uses mmdet-2.28.2
     cd ..
     git clone git@github.com:open-mmlab/mmdetection.git
     cd mmdetection
     pip install -v -e .
    
     # build panopticapi from source
     cd ..
     git clone git@github.com/cocodataset/panopticapi.git
     cd panopticapi
     pip install -v -e .
    
     # install other dependencies
     cd ../..
     pip install -r requirements.txt
    

Get Started

Prepare pretrained models

Before performing the following steps, please download our pretrained models first.

We release the models for open vocabulary instance segmentation (OVIS), open vocabulary object detection (OVOD), and open-set panoptic segmentation (OSPS). For the details of OSPS, please refer to this paper.

Model 🔗 Download Links Task
CGG-COCO-Instances [Google Drive] [Baidu Disk] OVIS & OVOD
CGG-COCO-Panoptic [Google Drive] [Baidu Disk] OSPS

Then, place the models to chekpoints directory.

The directory structure will be arranged as:

checkpoints
   |- README.md
   |- coco_instance_ag3x_1x.pth
   |- coco_panoptic_p20.pth

Quick inference

We provide a jupyter notebook for inferencing our model on both OVIS and OVPS. Feel free to upload your own images to test our model's ability on various scenarios!

Prepare datasets

Dataset COCO ADE20K
Details For training and evaluation For evaluation
Download Official Link ADEChanllengeData2016

For the COCO dataset, we use the 2017 version images and annotations. Please download train2017 and val2017 images. For annotatoins, we use captions_train2017.json, instances_train/test2017.json, and panoptic_train/val2017.json.

For the evaluation on ADE20K dataset, we use the MIT Scene Parsing Benchmark validation set, which contains 100 classes. Please download the converted COCO-format annotation file form here and put it in the annotations folder.

Please put all the datasets to the data directory. The data directory structure will be arranged as:

data
    |- ade20k
        |- ADEChallengeData2016
            |- annotations
                |- train
                |- validation
                |- ade20k_instances_val.json
            |- images
                |- train
                |- validation
            |- objectsInfo150.txt
            |- sceneCategories.txt
    |- coco
        |- annotations
            |- captions_train2017.json
            |- instances_train2017.json
            |- instances_val2017.json
            |- panoptic_train2017.json
            |- panoptic_val2017.json
        |- train2017
        |- val2017

Evaluation

We provide evaluation code for the COCO dataset.

Run the following commands for evaluation on OVIS.

Run on single GPU:

python tools/test.py \
    configs/instance/coco_b48n17.py \
    checkpoints/coco_instance_ag3x_1x.pth \
    --eval bbox segm

Run on multiple GPUs:

bash ./tools/dist_test.sh \
    configs/instance/coco_b48n17.py \
    checkpoints/coco_instance_ag3x_1x.pth \
    8 \
    --eval bbox segm

Run the following commands for evaluation on OSPS.

Run on single GPU:

python tools/test.py \
    configs/openset_panoptic/coco_panoptic_p20.py \
    checkpoints/coco_panoptic_p20.pth \
    --eval bbox segm

Run on multiple GPUs:

bash ./tools/dist_test.sh \
    configs/openset_panoptic/coco_panoptic_p20.py \
    checkpoints/coco_panoptic_p20.pth \
    8 \
    --eval bbox segm

You will get the scores as paper reported. The output will also be saved in work_dirs/{config_name}.

Training

Our model first pre-trains in an class-agnostic manner. The pre-train configs are provided in configs/instance/coco_ag_pretrain_3x (for OVIS) and configs/openset_panoptic/p{5/10/20}_ag_pretrain (for OSPS).

Run the following commands for class-agnotic pre-training.

Run on single GPU:

# OVIS
python tools/train.py \
    configs/instance/coco_ag_pretrain_3x.py
# OSPS
python tools/train.py \
    configs/openset_panoptic/p20_ag_pretrain.py

Run on multiple GPUs:

# OVIS
bash ./tools/dist_train.sh \
    configs/instance/coco_ag_pretrain_3x.py \
    8
# OSPS
bash ./tools/dist_train.sh \
    configs/openset_panoptic/p20_ag_pretrain.py \
    8

The pre-training on OVIS takes 36 epochs and may need a long time. Here we provide downloads for the class-agnostic pre-trained models.

Model 🔗 Download Links Task
CGG-instance-pretrain [Google Drive] [Baidu Disk] OVIS & OVOD
CGG-panoptic-pretrain [Google Drive] [Baidu Disk] OSPS

The directory structure will be arranged as:

pretrained
   |- README.md
   |- class_ag_pretrained_3x.pth
   |- panoptic_p20_ag_pretrain.pth

If you perform the class-agnostic pre-training by yourself, please rename the pre-trained model saved in work_dirs and save them into the pretrained folder as the directory structure above. The training configs will load the pre-trained weights.

After pre-training, the open vocabulary training configs are provided in configs/instance_coco_b48n17 (for OVIS) and configs/openset_panoptic/coco_panoptic_p{5/10/20} (for OSPS).

Run one of the following commands for training.

Run on single GPU:

# OVIS
python tools/train.py \
    configs/instance/coco_b48n17.py
# OSPS
python tools/train.py \
    configs/openset_panoptic/coco_panoptic_p20.py

Run on multiple GPUs:

# OVIS
bash ./tools/dist_train.sh \
    configs/instance/coco_b48n17.py \
    8
# OSPS
bash ./tools/dist_train.sh \
    configs/openset_panoptic/coco_panoptic_p20.py \
    8

The output will be saved in work_dirs/{config_name}.

Results

Quantitative results

Results on OVIS:

result-OVIS

Results on OVIS:

result-OVOD

Results on OVIS:

result-OSPS

Citation

If you find our repo useful for your research, please consider citing our paper:

@article{wu2023betrayed,
   title={Betrayed by captions: Joint caption grounding and generation for open vocabulary instance segmentation},
   author={Wu, Jianzong and Li, Xiangtai and Ding, Henghui and Li, Xia and Cheng, Guangliang and Tong, Yunhai and Loy, Chen Change},
   journal={arXiv preprint arXiv:2301.00805},
   year={2023}
 }

Contact

If you have any question, please feel free to contact us via jzwu@stu.pku.edu.cn or xiangtai.li@ntu.edu.sg.

License

Licensed under a Creative Commons Attribution-NonCommercial 4.0 International for Non-commercial use only. Any commercial use should get formal permission first.

Acknowledgement

This repository is maintained by Jianzong Wu and Xiangtai Li.

This code is based on MMDetection.