Decoupling Zero-Shot Semantic Segmentation

This is the official code for the ZegFormer (CVPR 2022).

ZegFormer is the first framework that decouple the zero-shot semantic segmentation into: 1) class-agnostic segmentation and 2) segment-level zero-shot classification

Visualization of semantic segmentation with open vocabularies

ZegFormer is able to segment stuff and things with open vocabularies. The predicted classes can be more fine-grained than the COCO-Stuff annotations (see colored boxes below).

Data Preparation

See data preparation

Config files

For each model, there are two kinds of config files. The file without suffix "_gzss_eval" is used for training. The file with suffix "_gzss_eval" is used for generalized zero-shot semantic segmentation evaluation.

Inference Demo with Pre-trained Models

Download the checkpoints of ZegFormer from https://drive.google.com/drive/u/0/folders/1qcIe2mE1VRU1apihsao4XvANJgU5lYgm

python demo/demo.py --config-file configs/coco-stuff/zegformer_R101_bs32_60k_vit16_coco-stuff_gzss_eval.yaml \
  --input input1.jpg input2.jpg \
  [--other-options]
  --opts MODEL.WEIGHTS /path/to/zegformer_R101_bs32_60k_vit16_coco-stuff.pth

The configs are made for training, therefore we need to specify MODEL.WEIGHTS to a model from model zoo for evaluation. This command will run the inference and show visualizations in an OpenCV window.

For details of the command line arguments, see demo.py -h or look at its source code to understand its behavior. Some common arguments are:

To run on your webcam, replace --input files with --webcam.
To run on a video, replace --input files with --video-input video.mp4.
To run on cpu, add MODEL.DEVICE cpu after --opts.
To save outputs to a directory (for images) or a file (for webcam or video), use --output.

Inference with more classnames

In the example above, the model is trained with 156 classes, and inferenced with 171 classes.

If you want to inference with more classes, try the config zegformer_R101_bs32_60k_vit16_coco-stuff_gzss_eval_847_classes.yaml.

Training & Evaluation in Command Line

To train models with R-101 backbone, download the pre-trained model R-101.pkl, which is a converted copy of MSRA's original ResNet-101 model.

We provide two scripts in train_net.py, that are made to train all the configs provided in MaskFormer.

To train a model with "train_net.py", first setup the corresponding datasets following datasets/README.md, then run:

./train_net.py --num-gpus 8 \
  --config-file configs/coco-stuff/zegformer_R101_bs32_60k_vit16_coco-stuff.yaml

The configs are made for 8-GPU training. Since we use ADAMW optimizer, it is not clear how to scale learning rate with batch size. To train on 1 GPU, you need to figure out learning rate and batch size by yourself:

./train_net.py \
  --config-file configs/coco-stuff/zegformer_R101_bs32_60k_vit16_coco-stuff.yaml \
  --num-gpus 1 SOLVER.IMS_PER_BATCH SET_TO_SOME_REASONABLE_VALUE SOLVER.BASE_LR SET_TO_SOME_REASONABLE_VALUE

To evaluate a model's performance, use

./train_net.py \
  --config-file configs/coco-stuff/zegformer_R101_bs32_60k_vit16_coco-stuff_gzss_eval.yaml \
  --eval-only MODEL.WEIGHTS /path/to/checkpoint_file

For more options, see ./train_net.py -h.

The pre-trained checkpoints of ZegFormer can be downloaded from https://drive.google.com/drive/folders/1qcIe2mE1VRU1apihsao4XvANJgU5lYgm?usp=sharing

Disclaimer

Although the reported results on PASCAL VOC are trained with 10k iterations, the results at 10k are not stable. We recommend to train models with longer iterations.

Acknowlegment

This repo benefits from CLIP and MaskFormer. Thanks for their wonderful works.

Citation

@article{ding2021decoupling,
  title={Decoupling Zero-Shot Semantic Segmentation},
  author={Ding, Jian and Xue, Nan and Xia, Gui-Song and Dai, Dengxin},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2022}
}

If you have any problems in using this code, please contact me (jian.ding@whu.edu.cn)

dingjiansw101/ZegFormer