Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models

The repo contains official PyTorch implementation of BMVC 2022 oral paper Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models by Chaofan Ma*, Yuhuan Yang*, Yanfeng Wang, Ya Zhang, and Weidi Xie.

For more information, check out the project page and the paper on arXiv.

Requirements

python==3.7.11
torch==1.9.0
torchvision==0.10.0
clip (from https://github.com/openai/CLIP)
einops==0.3.2
timm==0.4.12
albumentations==1.1.0
opencv-python==4.5.5.64

Data Preparation

Same as LSeg, we follow HSNet for data preparation. The datasets should be appropriately placed to have following directory structure:

For PASCAL-$5^i$ dataset:

dataset_root
    ├── SegmentationClassAug
    └── VOCdevkit
        └── VOC2012
            ├── Annotations
            ├── ImageSets
            ├── JPEGImages
            ├── SegmentationClass
            └── SegmentationObject

For COCO-$20^i$ dataset:

dataset_root
    ├── annotation
    ├── train2014
    └── val2014

More details such as datasets downloading please refers to HSNet datasets preparing.

Training

python train.py --dataset_name {pascal, coco} \
                --dataset_root your/pascal/or/coco/dataset_root \
                --fold {0, 1, 2, 3}

Evaluation

python test.py --dataset_name {pascal, coco} \
               --dataset_root your/pascal/or/coco/dataset_root \
               --fold {0, 1, 2, 3} \
               --test_with_org_resolution \
               --load_ckpt_path path/to/saved/checkpoint

Currently, we do not add code about model saving when training, write it by yourself then pass through --load_ckpt_path for evaluation.

Citation

If this code is useful for your research, please consider citing:

@inproceedings{ma2022fusioner,
  title     = {Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models},
  author    = {Chaofan Ma, Yuhuan Yang, YanFeng Wang, Ya Zhang and Weidi Xie},
  booktitle = {British Machine Vision Conference},
  year      = {2022}
}

Acknowledgements

Many thanks to the code bases from LSeg, CLIP, Segmenter, HSNet.