Learning Open-Vocabulary Semantic Segmentation Models From Natural Language Supervision

This repository is the official implementation of Learning Open-Vocabulary Semantic Segmentation Models From Natural Language Supervision at CVPR 2023. Our transformer-based model, termed as OVSegmentor, is pre-trained on image-text pairs without using any mask annotations. After training, it can segment objects of arbitrary categories via zero-shot transfer.

Prepare environment

Please refer to docs/ENV_README.md

Prepare datasets

Please refer to docs/DATA_README.md

Prepare model

Please refer to docs/MODEL_README.md

Demo

To directly use the model for open-vocab segmentation:

(1) Download the pretrained model.

(2) Prepare your images, place them in the same folder, for example,

- OVSegmentor
  - visualization
    - input
      - 1.jpg
      - 2.jpg
      ...
    - output

(3) Option1 --- Perform segmentation using the vocabulary of existing classes (e.g. voc/coco/ade), simply run:

python -u -m main_demo
    --cfg configs/test_voc12.yml \
    --resume /path/to/the/pretrained_checkpoint.pth \
    --vis input_pred_label \  
    --vocab voc \
    --image_folder ./visualization/input/ \
    --output_folder ./visualization/output/ \

(3) Option2 --- Perform segmentation using the custom classes

python -u -m main_demo
    --cfg configs/test_voc12.yml \
    --resume /path/to/the/pretrained_checkpoint.pth \
    --vis input_pred_label \  
    --vocab cat dog train bus \ ### list your open-vocabulary classes here
    --image_folder ./visualization/input/ \
    --output_folder ./visualization/output/ \

You can also save the segmentation masks (e.g. for evaluation) by running

python -u -m main_demo
    --cfg configs/test_voc12.yml \
    --resume /path/to/the/pretrained_checkpoint.pth \
    --vis mask \  
    --vocab voc \
    --image_folder ./visualization/input/ \
    --output_folder ./visualization/output/ \

Training

To train the model(s) in the paper, we separate the training process as a two-stage pipeline. The first stage is a 30-epoch training with image-caption contrastive loss and masked entity completion loss, and the second-stage 10-epoch training further adds the cross-image mask consistency loss.

For the first stage training on a single node with 8 A100 (80G) GPUs, we recommand to use slurm script to enable training:

./scripts/run_slurm.sh

Or simply use torch.distributed.launch as:

./scripts/run.sh

Stage-2 training: change the 1st stage checkpoint in the 2nd stage config file. We also provide our pre-trained 1st stage checkpoint from here.

stage1_checkpoint: /path/to/your/stage1_best_miou.pth

Then, perform the second stage training.

./scripts/run_slurm_stage2.sh

Evaluation

To evaluate the model on PASCAL VOC, please specify the resume checkpoint path in tools/test_voc12.sh, and run:

./scripts/test_voc12.sh

For PASCAL Context, COCO Object, and ADE20K, please refer to ./scripts/.

The performance may vary 3%~4% due to different cross-image sampling.

Model Zoo

The pre-trained models can be downloaded from here:

Model name	Visual enc	Text enc	Group tokens	PASCAL VOC	PASCAL Context	COCO Object	ADE20K	Checkpoint
OVSegmentor	ViT-B	BERT-Base	8	53.8	20.4	25.1	5.6	download
OVSegmentor	ViT-S	Roberta-Base	8	44.5	18.3	19.0	4.3	download
OVSegmentor	ViT-B	BERT-Base	16	Todo	Todo	Todo	Todo	Todo

Citation

If this work is helpful for your research, please consider citing us.

@inproceedings{xu2023learning,
  title={Learning open-vocabulary semantic segmentation models from natural language supervision},
  author={Xu, Jilan and Hou, Junlin and Zhang, Yuejie and Feng, Rui and Wang, Yi and Qiao, Yu and Xie, Weidi},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={2935--2944},
  year={2023}
}

Acknowledgements

This project is built upon GroupViT. Thanks to the contributors of the great codebase.

Jazzcharles/OVSegmentor