/lang-seg

Language-Driven Semantic Segmentation

Primary LanguageJupyter NotebookMIT LicenseMIT

PROJECT NOT UNDER ACTIVE MANAGEMENT

This project will no longer be maintained by Intel.
Intel has ceased development and contributions including, but not limited to, maintenance, bug fixes, new releases, or updates, to this project.
Intel no longer accepts patches to this project.
If you have an ongoing need to use this project, are interested in independently developing it, or would like to maintain patches for the open source software community, please create your own fork of this project.

Language-driven Semantic Segmentation (LSeg)

The repo contains official PyTorch Implementation of paper Language-driven Semantic Segmentation.

ICLR 2022

Authors:

Overview

We present LSeg, a novel model for language-driven semantic image segmentation. LSeg uses a text encoder to compute embeddings of descriptive input labels (e.g., ''grass'' or 'building'') together with a transformer-based image encoder that computes dense per-pixel embeddings of the input image. The image encoder is trained with a contrastive objective to align pixel embeddings to the text embedding of the corresponding semantic class. The text embeddings provide a flexible label representation in which semantically similar labels map to similar regions in the embedding space (e.g., ''cat'' and ''furry''). This allows LSeg to generalize to previously unseen categories at test time, without retraining or even requiring a single additional training sample. We demonstrate that our approach achieves highly competitive zero-shot performance compared to existing zero- and few-shot semantic segmentation methods, and even matches the accuracy of traditional segmentation algorithms when a fixed label set is provided.

Please check our Video Demo (4k) to further showcase the capabilities of LSeg.

Usage

Installation

Option 1:

pip install -r requirements.txt

Option 2:

conda install ipython
pip install torch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2
pip install git+https://github.com/zhanghang1989/PyTorch-Encoding/
pip install pytorch-lightning==1.3.5
pip install opencv-python
pip install imageio
pip install ftfy regex tqdm
pip install git+https://github.com/openai/CLIP.git
pip install altair
pip install streamlit
pip install --upgrade protobuf
pip install timm
pip install tensorboardX
pip install matplotlib
pip install test-tube
pip install wandb

Data Preparation

By default, for training, testing and demo, we use ADE20k.

python prepare_ade20k.py
unzip ../datasets/ADEChallengeData2016.zip

Note: for demo, if you want to use random inputs, you can ignore data loading and comment the code at link.

🌻 Try demo now

Download Demo Model

name backbone text encoder url
Model for demo ViT-L/16 CLIP ViT-B/32 download

👉 Option 1: Running interactive app

Download the model for demo and put it under folder checkpoints as checkpoints/demo_e200.ckpt.

Then streamlit run lseg_app.py

👉 Option 2: Jupyter Notebook

Download the model for demo and put it under folder checkpoints as checkpoints/demo_e200.ckpt.

Then follow lseg_demo.ipynb to play around with LSeg. Enjoy!

Training and Testing Example

Training: Backbone = ViT-L/16, Text Encoder from CLIP ViT-B/32

bash train.sh

Testing: Backbone = ViT-L/16, Text Encoder from CLIP ViT-B/32

bash test.sh

Zero-shot Experiments

Data Preparation

Please follow HSNet and put all dataset in data/Dataset_HSN

Pascal-5i

for fold in 0 1 2 3; do
python -u test_lseg_zs.py --backbone clip_resnet101 --module clipseg_DPT_test_v2 --dataset pascal \
--widehead --no-scaleinv --arch_option 0 --ignore_index 255 --fold ${fold} --nshot 0 \
--weights checkpoints/pascal_fold${fold}.ckpt 
done

COCO-20i

for fold in 0 1 2 3; do
python -u test_lseg_zs.py --backbone clip_resnet101 --module clipseg_DPT_test_v2 --dataset coco \
--widehead --no-scaleinv --arch_option 0 --ignore_index 255 --fold ${fold} --nshot 0 \
--weights checkpoints/pascal_fold${fold}.ckpt 
done

FSS

python -u test_lseg_zs.py --backbone clip_vitl16_384 --module clipseg_DPT_test_v2 --dataset fss \
--widehead --no-scaleinv --arch_option 0 --ignore_index 255 --fold 0 --nshot 0 \
--weights checkpoints/fss_l16.ckpt 
python -u test_lseg_zs.py --backbone clip_resnet101 --module clipseg_DPT_test_v2 --dataset fss \
--widehead --no-scaleinv --arch_option 0 --ignore_index 255 --fold 0 --nshot 0 \
--weights checkpoints/fss_rn101.ckpt 

Model Zoo

dataset fold backbone text encoder performance url
pascal 0 ResNet101 CLIP ViT-B/32 52.8 download
pascal 1 ResNet101 CLIP ViT-B/32 53.8 download
pascal 2 ResNet101 CLIP ViT-B/32 44.4 download
pascal 3 ResNet101 CLIP ViT-B/32 38.5 download
coco 0 ResNet101 CLIP ViT-B/32 22.1 download
coco 1 ResNet101 CLIP ViT-B/32 25.1 download
coco 2 ResNet101 CLIP ViT-B/32 24.9 download
coco 3 ResNet101 CLIP ViT-B/32 21.5 download
fss - ResNet101 CLIP ViT-B/32 84.7 download
fss - ViT-L/16 CLIP ViT-B/32 87.8 download

If you find this repo useful, please cite:

@inproceedings{
li2022languagedriven,
title={Language-driven Semantic Segmentation},
author={Boyi Li and Kilian Q Weinberger and Serge Belongie and Vladlen Koltun and Rene Ranftl},
booktitle={International Conference on Learning Representations},
year={2022},
url={https://openreview.net/forum?id=RriDjddCLN}
}

Acknowledgement

Thanks to the code base from DPT, Pytorch_lightning, CLIP, Pytorch Encoding, Streamlit, Wandb