This is pytorch implementation code for realizing the technical part of CAusal Unsupervised Semantic sEgmentation (CAUSE) to improve performance of unsupervised semantic segmentation. This code is further developed by two baseline codes of HP: Leveraging Hidden Positives for Unsupervised Semantic Segmentation accepted in CVPR 2023 and STEGO: Unsupervised Semantic Segmentation by Distilling Feature Correspondences accepted in ICLR 2022.
You can see the following bundle of images in Appendix. Further, we explain concrete implementation beyond the description of the main paper.
Figure 1. Visual comparison of USS for COCO-stuff. Note that, in contrast to true labels, baseline frameworks fail to achieve targeted level of granularity, while CAUSE successfully clusters person, sports, vehicle, etc. Figure 2. Qualitative comparison of unsupervised semantic segmentation for Cityscapes. Figure 3. Log scale of mIoU results for each categories in COCO-Stuff (Black: Thing / Gray: Stuff )You can download the checkpoint files including CAUSE-trained parameters based on DINO, DINOv2, iBOT, MSN, MAE in self-supervised vision transformer framework. If you want to download the pretrained models of DINO in various structures the following CAUSE uses, you can download them in the following links:
Dataset | Method | Baseline | mIoU(%) | pAcc(%) | Visual Quality | Seg Head Parameter | Concept ClusterBook |
---|---|---|---|---|---|---|---|
COCO-Stuff | DINO+CAUSE-MLP | ViT-S/8 | 27.9 | 66.8 | [link] | [link] | [link] |
COCO-Stuff | DINO+CAUSE-TR | ViT-S/8 | 32.4 | 69.6 | [link] | [link] | [link] |
COCO-Stuff | DINO+CAUSE-MLP | ViT-S/16 | 25.9 | 66.3 | [link] | [link] | [link] |
COCO-Stuff | DINO+CAUSE-TR | ViT-S/16 | 33.1 | 70.4 | [link] | [link] | [link] |
COCO-Stuff | DINO+CAUSE-MLP | ViT-B/8 | 34.3 | 72.8 | [link] | [link] | [link] |
COCO-Stuff | DINO+CAUSE-TR | ViT-B/8 | 41.9 | 74.9 | [link] | [link] | [link] |
COCO-Stuff | DINOv2+CAUSE-TR | ViT-B/14 | 45.3 | 78.0 | [link] | [link] | [link] |
COCO-Stuff | iBOT+CAUSE-TR | ViT-B/16 | 39.5 | 73.8 | [link] | [link] | [link] |
COCO-Stuff | MSN+CAUSE-TR | ViT-S/16 | 34.1 | 72.1 | [link] | [link] | [link] |
COCO-Stuff | MAE+CAUSE-TR | ViT-B/16 | 21.5 | 59.1 | [link] | [link] | [link] |
Dataset | Method | Baseline | mIoU(%) | pAcc(%) | Visual Quality | Seg Head Parameter | Concept ClusterBook |
---|---|---|---|---|---|---|---|
Cityscapes | DINO+CAUSE-MLP | ViT-S/8 | 21.7 | 87.7 | [link] | [link] | [link] |
Cityscapes | DINO+CAUSE-TR | ViT-S/8 | 24.6 | 89.4 | [link] | [link] | [link] |
Cityscapes | DINO+CAUSE-MLP | ViT-B/8 | 25.7 | 90.3 | [link] | [link] | [link] |
Cityscapes | DINO+CAUSE-TR | ViT-B/8 | 28.0 | 90.8 | [link] | [link] | [link] |
Cityscapes | DINOv2+CAUSE-TR | ViT-B/14 | 29.9 | 89.8 | [link] | [link] | [link] |
Cityscapes | iBOT+CAUSE-TR | ViT-B/16 | 23.0 | 89.1 | [link] | [link] | [link] |
Cityscapes | MSN+CAUSE-TR | ViT-S/16 | 21.2 | 89.1 | [link] | [link] | [link] |
Cityscapes | MAE+CAUSE-TR | ViT-B/16 | 12.5 | 82.0 | [link] | [link] | [link] |
Dataset | Method | Baseline | mIoU(%) | pAcc(%) | Visual Quality | Seg Head Parameter | Concept ClusterBook |
---|---|---|---|---|---|---|---|
Pascal VOC | DINO+CAUSE-MLP | ViT-S/8 | 46.0 | - | [link] | [link] | [link] |
Pascal VOC | DINO+CAUSE-TR | ViT-S/8 | 50.0 | - | [link] | [link] | [link] |
Pascal VOC | DINO+CAUSE-MLP | ViT-B/8 | 47.9 | - | [link] | [link] | [link] |
Pascal VOC | DINO+CAUSE-TR | ViT-B/8 | 53.3 | - | [link] | [link] | [link] |
Pascal VOC | DINOv2+CAUSE-TR | ViT-B/14 | 53.2 | 91.5 | [link] | [link] | [link] |
Pascal VOC | iBOT+CAUSE-TR | ViT-B/16 | 53.4 | 89.6 | [link] | [link] | [link] |
Pascal VOC | MSN+CAUSE-TR | ViT-S/16 | 30.2 | 84.2 | [link] | [link] | [link] |
Pascal VOC | MAE+CAUSE-TR | ViT-B/16 | 25.8 | 83.7 | [link] | [link] | [link] |
Dataset | Method | Baseline | mIoU(%) | pAcc(%) | Visual Quality | Seg Head Parameter | Concept ClusterBook |
---|---|---|---|---|---|---|---|
COCO-81 | DINO+CAUSE-MLP | ViT-S/8 | 19.1 | 78.8 | [link] | [link] | [link] |
COCO-81 | DINO+CAUSE-TR | ViT-S/8 | 21.2 | 75.2 | [link] | [link] | [link] |
COCO-171 | DINO+CAUSE-MLP | ViT-S/8 | 10.6 | 44.9 | [link] | [link] | [link] |
COCO-171 | DINO+CAUSE-TR | ViT-S/8 | 15.2 | 46.6 | [link] | [link] | [link] |
.
├── loader
│ ├── netloader.py # Self-Supervised Pretrained Model Loader & Segmentation Head Loader
│ └── dataloader.py # Dataloader Thanks to STEGO [ICLR 2022]
│
├── models # Model Design of Self-Supervised Pretrained: [DINO/DINOv2/iBOT/MAE/MSN]
│ ├── dinomaevit.py # ViT Structure of DINO and MAE
│ ├── dinov2vit.py # ViT Structure of DINOv2
│ ├── ibotvit.py # ViT Structure of iBOT
│ └── msnvit.py # ViT Structure of MSN
│
├── modules # Segmentation Head and Its Necessary Function
│ └── segment_module.py # [Including Tools with Generating Concept Book and Contrastive Learning
│ └── segment.py # [MLP & TR] Including Tools with Generating Concept Book and Contrastive Learning
│
├── utils
│ └── utils.py # Utility for auxiliary tools
│
├── train_modularity.py # (STEP 1) [MLP & TR] Generating Concept Cluster Book as a Mediator
│
├── train_front_door_mlp.py # (STEP 2) [MLP] Frontdoor Adjustment through Unsupervised Semantic Segmentation
├── fine_tuning_mlp.py # (STEP 3) [MLP] Fine-Tuning Cluster Probe
│
├── train_front_door_tr.py # (STEP 2) [TR] Frontdoor Adjustment through Unsupervised Semantic Segmentation
├── fine_tuning_tr.py # (STEP 3) [TR] Fine-Tuning Cluster Probe
│
├── test_mlp.py # [MLP] Evaluating Unsupervised Semantic Segmantation Performance (Post-Processing)
├── test_tr.py # [TR] Evaluating Unsupervised Semantic Segmantation Performance (Post-Processing)
│
├── requirements.txt
└── README.md
For the first, we should generate the cropped dataset by following STEGO in ICLR 2022.
python crop_dataset.py --dataset "cocostuff27" --crop_type "five"
python crop_dataset.py --dataset "cityscapes" --crop_type "five"
python crop_dataset.py --dataset "pascalvoc" --crop_type "super"
python crop_dataset.py --dataset "cooc81" --crop_type "double"
python crop_dataset.py --dataset "cooc171" --crop_type "double"
And then,
bash run # All of the following three steps integrated
In this shell script file, you can see the following code
#!/bin/bash
######################################
# [OPTION] DATASET
# cocostuff27
dataset="cocostuff27"
#############
######################################
# [OPTION] STRUCTURE
structure="TR"
######################################
######################################
# [OPTION] Self-Supervised Method
ckpt="checkpoint/dino_vit_base_8.pth"
######################################
######################################
# GPU and PORT
if [ "$structure" = "MLP" ]
then
train_gpu="0,1,2,3"
elif [ "$structure" = "TR" ]
then
train_gpu="4,5,6,7"
fi
# Non-Changeable Variable
test_gpu="${train_gpu:0}"
port=$(($RANDOM%800+1200))
######################################
######################################
# [STEP1] MEDIATOR
python train_mediator.py --dataset $dataset --ckpt $ckpt --gpu $train_gpu --port $port
######################################
######################################
# [STEP2] CAUSE
if [ "$structure" = "MLP" ]
then
python train_front_door_mlp.py --dataset $dataset --ckpt $ckpt --gpu $train_gpu --port $port
python fine_tuning_mlp.py --dataset $dataset --ckpt $ckpt --gpu $train_gpu --port $port
elif [ "$structure" = "TR" ]
then
python train_front_door_tr.py --dataset $dataset --ckpt $ckpt --gpu $train_gpu --port $port
python fine_tuning_tr.py --dataset $dataset --ckpt $ckpt --gpu $train_gpu --port $port
fi
######################################
######################################
# TEST
if [ "$structure" = "MLP" ]
then
python test_mlp.py --dataset $dataset --ckpt $ckpt --gpu $test_gpu
elif [ "$structure" = "TR" ]
then
python test_tr.py --dataset $dataset --ckpt $ckpt --gpu $test_gpu
fi
######################################
python train_mediator.py # DINO/DINOv2/iBOT/MSN/MAE
python train_front_door_mlp.py # CAUSE-MLP
# or
python train_front_door_tr.py # CAUSE-TR
python fine_tuning_mlp.py # CAUSE-MLP
# or
python fine_tuning_tr.py # CAUSE-TR
python test_mlp.py # CAUSE-MLP
# or
python test_tr.py # CAUSE-TR
- Creating Virtual Environment by Anaconda
conda create -y -n neurips python=3.9
- Installing PyTorch Package in Virtual Envrionment
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
- Installing Pip Package
pip install -r requirements.txt
- [Optional] Removing Conda and PIP Cache if Conda and PIP have been locked by unknown reasons
conda clean -a && pip cache purge
Note: Pascal VOC is not necessary to download because dataloader will automatically download in your own dataset path
If the above do not work, then download azcopy and follow the below scripts
- azcopy copy "https://marhamilresearch4.blob.core.windows.net/stego-public/pytorch_data/cityscapes.zip" "custom_path" --recursive
- azcopy copy "https://marhamilresearch4.blob.core.windows.net/stego-public/pytorch_data/cocostuff.zip" "custom_path" --recursive
unzip cocostuff.zip && unzip cityscapes.zip