This is the official implementation for our CVPR 2024 paper "Rethinking Prior Information Generation with CLIP for Few-Shot Segmentation".
Please note that the experimental results may vary due to different environments and settings. In all experiments on PASCAL-5{i} and COCO-20{i}, the images are set to 473×473. For COCO-20{i}, setting higher resolution can get higher performance, besides, PI-CLIP is only trained for 30 epochs on both PASCAL-5{i} and COCO-20{i} the model can perform better if can be trained for more epochs. However, it is still acceptable to compare your results with those reported in the paper.
Abstract: Few-shot segmentation remains challenging due to the limitations of its labeling information for unseen classes. Most previous approaches rely on extracting high-level feature maps from the frozen visual encoder to compute the pixel-wise similarity as a key prior guidance for the decoder. However, such a prior representation suffers from coarse granularity and poor generalization to new classes since these high-level feature maps have obvious category bias. In this work, we propose to replace the visual prior representation with the visual-text alignment capacity to capture more reliable guidance and enhance the model generalization. Specifically, we design two kinds of training-free prior information generation strategy that attempts to utilize the semantic alignment capability of the Contrastive Language-Image Pre-training model (CLIP) to locate the target class. Besides, to acquire more accurate prior guidance, we build a high-order relationship of attention maps and utilize it to refine the initial prior information. Experiments on both the PASCAL-5{i} and COCO-20{i} datasets show that our method obtains a clearly substantial improvement and reaches the new state-of-the-art performance.
- python == 3.10.4
- torch == 1.12.1
- torchvision == 0.13.1
- cuda == 11.6
- mmcv-full == 1.7.1
- mmsegmentation == 0.30.0
cd PI-CLIP
git clone https://github.com/lucasb-eyer/pydensecrf
cd pydensecrf
python setup.py install
#install other packages
cd PI_CLIP
python env.py
Please download the following datasets and put them into the ../data
directory.:
-
PASCAL-5i: PASCAL VOC 2012 and SBD
-
COCO-20i: COCO 2014.
The lists generation are followed PFENet. You can direct download and put them into the ./lists
directory.
Before running the code, you should generate the annotations for base classes by running util/get_mulway_base_data.py
, more details are available at BAM.
We have adopted the same procedures as BAM and HDMNet for the pre-trained backbones, placing them in the ../initmodel
directory.
Download CLIP pre-trained ViT-B/16 at here and put it to ../initmodel/clip
-
First update the configurations in the
./config
for training or testing -
Train script
sh train.sh [exp_name] [dataset] [GPUs]
# Example (split0 | PASCAL VOC2012 | 2 GPUs for traing):
# sh train.sh split0 pascal 2
- Test script
sh test.sh [exp_name] [dataset] [GPUs]
# Example (split0 | COCO dataset | 1 GPU for testing):
# sh test.sh split0 coco 1
This repository owes its existence to the exceptional contributions of other projects:
- PFENet: https://github.com/dvlab-research/PFENet
- BAM: https://github.com/chunbolang/BAM
- HDMNet: https://github.com/Pbihao/HDMNet
Many thanks for their excellent work.
If you have any question, welcome email me at 'wangjin@s.upc.deu.cn'
If you find our work and this repository useful. Please consider giving a star and citation.
@inproceedings{wang2024rethinking,
title={Rethinking Prior Information Generation with CLIP for Few-Shot Segmentation},
author={Wang, Jin and Zhang, Bingfeng and Pang, Jian and Chen, Honglong and Liu, Weifeng},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={3941--3951},
year={2024}
}