This is an official release of the paper CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction.
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction,
Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Xiangtai Li, Wentao Liu, Chen Change Loy
Bibetex
- Code and models of CLIPSelf
- Code and models of F-ViT
- Support F-ViT under the ovdet repo using MMDetection3.x
This project is adapted from OpenCLIP-v2.16.0. Run the following command to install the package
pip install -e . -v
The main experiments are conducted using images from COCO and LVIS datasets. Please prepare datasets and organize them like the following:
CLIPSelf/
├── data
├── coco
├── annotations
├── instances_train2017.json # the box annotations are not used
├── panoptic_val2017.json
├── panoptic_val2017 # panoptic masks
├── train2017
├── val2017
├── coco_pseudo_4764.json # to run RegionCLIP
├── coco_proposals.json # to run CLIPSelf with region proposals
├── lvis_v1
├── annotations
├── lvis_v1_train.json # the box annotations are not used
├── train2017 # the same with coco
├── val2017 # the same with coco
For CLIPSelf with region proposals or RegionCLIP that uses region-text pairs, obtain coco_pseudo_4764.json
or coco_proposals.json
from Drive. Put the json files under data/coco
.
To run CLIPSelf, first obtain the original models from
EVA-02-CLIP, and put them under
checkpoints/
like the following:
CLIPSelf/
├── checkpoints
├── EVA02_CLIP_B_psz16_s8B.pt
├── EVA02_CLIP_L_336_psz14_s6B.pt
We provide the scripts to train CLIPSelf and RegionCLIP under scripts/, they are summarized as follows:
# | Model | Method | Proposals | Training Data | Script | Checkpoint |
---|---|---|---|---|---|---|
1 | ViT-B/16 | CLIPSelf | - | COCO | script | model |
2 | ViT-B/16 | CLIPSelf | + | COCO | script | model |
3 | ViT-B/16 | RegionCLIP | + | COCO | script | model |
4 | ViT-L/14 | CLIPSelf | - | COCO | script | model |
5 | ViT-L/14 | CLIPSelf | + | COCO | script | model |
6 | ViT-L/14 | RegionCLIP | + | COCO | script | model |
7 | ViT-B/16 | CLIPSelf | - | LVIS | script | model |
8 | ViT-L/14 | CLIPSelf | - | LVIS | script | model |
For example, if we want to refine ViT-B/16 by CLIPSelf using only image patches on COCO, simply run:
bash scripts/train_clipself_coco_image_patches_eva_vitb16.sh # 1
We also provide the checkpoints of the listed experiments above in Drive. And they can be organized as follows:
CLIPSelf/
├── checkpoints
├── eva_vitb16_coco_clipself_patches.pt # 1
├── eva_vitb16_coco_clipself_proposals.pt # 2
├── eva_vitb16_coco_regionclip.pt # 3
├── eva_vitl14_coco_clipself_patches.pt # 4
├── eva_vitl14_coco_clipself_proposals.pt # 5
├── eva_vitl14_coco_regionclip.pt # 6
├── eva_vitb16_lvis_clipself_patches.pt # 7
├── eva_vitl14_lvis_clipself_patches.pt # 8
To evaluate a ViT-B/16 model, run:
bash scripts/test_eva_vitb16_macc_boxes_masks.sh name_of_the_test path/to/checkpoint.pt
To evaluate a ViT-L/14 model, run:
bash scripts/test_eva_vitl14_macc_boxes_masks.sh name_of_the_test path/to/checkpoint.pt
Go to the folder CLIPSelf/F-ViT
and follow the instructions in this README.
This project is licensed under NTU S-Lab License 1.0.
@article{wu2023clipself,
title={CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction},
author={Size Wu and Wenwei Zhang and Lumin Xu and Sheng Jin and Xiangtai Li and Wentao Liu and Chen Change Loy},
journal={arXiv preprint arXiv:2310.01403},
year={2023}
}
We thank OpenCLIP, EVA-CLIP and MMDetection for their valuable code bases.