04/14/2022: GLIP has been accepted to CVPR 2022 as an oral presentation! First version of code and pre-trained models are released!
12/06/2021: GLIP paper on arxiv https://arxiv.org/abs/2112.03857.
11/23/2021: Project page built.
This repository is the project page for GLIP. GLIP demonstrate strong zero-shot and few-shot transferability to various object-level recognition tasks.
- When directly evaluated on COCO and LVIS (without seeing any images in COCO), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing many supervised baselines.
- After fine-tuned on COCO, GLIP achieves 60.8 AP on val and 61.5 AP on test-dev, surpassing prior SoTA.
- When transferred to 13 downstream object detection tasks, a few-shot GLIP rivals with a fully-supervised Dynamic Head.
We provide code to:
- pre-train GLIP on detection and grounding data;
- zero-shot evaluate GLIP on standard benchmarks (COCO, LVIS, Flickr30K) and custom COCO-formated datasets;
- fine-tune GLIP on standard benchmarks (COCO) and custom COCO-formated datasets.
Please see respective sections for instructions.
Environment
This repo requires Pytorch>=1.9 and torchvision. We recommand using docker to setup the environment. You can use this pre-built docker image docker pull pengchuanzhang/maskrcnn:ubuntu18-py3.7-cuda10.2-pytorch1.9
or this one docker pull pengchuanzhang/pytorch:ubuntu20.04_torch1.9-cuda11.3-nccl2.9.9
depending on your GPU.
Then install the following packages:
pip install einops shapely timm yacs tensorboardX ftfy prettytable pymongo
pip install transformers
python setup.py build develop --user
Backbone Checkpoints. Download the ImageNet pre-trained backbone checkpoints into the MODEL
folder.
mkdir MODEL
wget https://penzhanwu2bbs.blob.core.windows.net/data/GLIPv1_Open/models/swin_base_patch4_window7_224.pth -O MODEL/swin_tiny_patch4_window7_224.pth
wget https://penzhanwu2bbs.blob.core.windows.net/data/GLIPv1_Open/models/swin_large_patch4_window12_384_22k.pth -O MODEL/swin_large_patch4_window12_384_22k.pth
Model | COCO [1] | LVIS [2] | LVIS [3] | ODinW [4] | Pre-Train Data | Config | Weight |
---|---|---|---|---|---|---|---|
GLIP-T (C) | 46.7 / 55.1 | 14.3 | 17.7 | 44.4 | Objects365,GoldG | config | weight |
GLIP-T [5] | 46.6 / 55.2 | 17.6 | 20.1 | 42.7 | Objects365,GoldG,CC3M,SBU | config [6] | weight |
GLIP-L [7] | 51.4 / 61.7 [8] | 29.3 | 30.1 | 51.2 | FourODs,GoldG,CC3M+12M,SBU | config [9] | weight |
[1] Zero-shot and fine-tuning performance on COCO val2017.
[2] Zero-shot performance on LVIS minival (APr) with the last pre-trained checkpoint.
[3] On LVIS, the model could overfit slightly during the pre-training course. Thus we reported two numbers on LVIS: the performance of the last checkpoint (LVIS[2]) and the performance of the best checkpoint during the pre-training course (LVIS[3]).
[4] Zero-shot performance on the 13 ODinW datasets.
[5] GLIP-T released in this repo is pre-trained on Conceptual Captions 3M and SBU captions. It is referred in paper in Table 1 and in Appendix C.3. It differs slightly from the GLIP-T in the main paper in terms of downstream performance. We will release the pre-training support for using CC3M and SBU captions data in the next update.
[6] This config is only intended for zero-shot evaluation and fine-tuning. Pre-training config with support for using CC3M and SBU captions data will be updated.
[7] GLIP-L released in this repo is pre-trained on Conceptual Captions 3M+12M and SBU captions. It slightly outperforms the GLIP-L in the main paper because the model used to annotate the caption data are improved compared to the main paper. We will release the pre-training support for using CC3M+12M and SBU captions data in the next update.
[8] Multi-scale testing used.
[9] This config is only intended for zero-shot evaluation and fine-tuning. Pre-training config with support for using CC3M+12M and SBU captions data to be updated.
Required Data. Prepare Objects365
, Flickr30K
, and MixedGrounding
data as in DATA.md. Support for training using caption data (Conceptual Captions and SBU captions) will be released soon.
Command.
Perform pre-training with the following command (please change the config-file accordingly; checkout model zoo for the corresponding config; change the {output_dir}
to your desired output directory):
python -m torch.distributed.launch --nnodes 2 --nproc_per_node=16 tools/train_net.py \
--config-file configs/pretrain/glip_Swin_T_O365_GoldG.yaml \
--skip-test --use-tensorboard --override_output_dir {output_dir}
For training GLIP-T models, we used nnodes = 2
, nproc_per_node=16
on 32GB V100 machines. For training GLIP-L models, we used nnodes = 4
, nproc_per_node=16
on 32GB V100 machines. Please setup the environment accordingly based on your local machine.
Prepare COCO/val2017
data as in DATA.md. Set {config_file}
, {model_checkpoint}
according to the Model Zoo
; set {output_dir}
to a folder where the evaluation results will be stored.
python tools/test_grounding_net.py --config-file {config_file} --weight {model_checkpoint} \
TEST.IMS_PER_BATCH 1 \
MODEL.DYHEAD.SCORE_AGG "MEAN" \
TEST.EVAL_TASK detection \
MODEL.DYHEAD.FUSE_CONFIG.MLM_LOSS False \
OUTPUT_DIR {output_dir}
We follow MDETR to evaluate with the FixedAP criterion. Set {config_file}
, {model_checkpoint}
according to the Model Zoo
. Prepare COCO/val2017
data as in DATA.md.
python -m torch.distributed.launch --nproc_per_node=4 \
tools/test_grounding_net.py \
--config-file {config_file} \
--task_config configs/lvis/minival.yaml \
--weight {model_checkpoint} \
TEST.EVAL_TASK detection OUTPUT_DIR {output_dir}
TEST.CHUNKED_EVALUATION 40 TEST.IMS_PER_BATCH 4 SOLVER.IMS_PER_BATCH 4 TEST.MDETR_STYLE_AGGREGATE_CLASS_NUM 3000 MODEL.RETINANET.DETECTIONS_PER_IMG 300 MODEL.FCOS.DETECTIONS_PER_IMG 300 MODEL.ATSS.DETECTIONS_PER_IMG 300 MODEL.ROI_HEADS.DETECTIONS_PER_IMG 300
If you wish to evaluate on Val 1.0, set --task_config
to configs/lvis/val.yaml
.
GLIP supports easy evaluation on a custom dataset. Currently, the code supports evaluation on COCO-formatted dataset.
We will use the Aquarium dataset from ODinW as an example to show how to evaluate on a custom COCO-formatted dataset.
-
Download the raw dataset from RoboFlow in the COCO format into
DATASET/odinw/Aquarium
. Each train/val/test split has a correspondingannotation
file and aimage
folder. -
Remove the background class from the annotation file. This can be as simple as open "_annotations.coco.json" and remove the entry with "id:0" from "categories". For convenience, we provide the modified annotation files for Aquarium:
wget https://penzhanwu2bbs.blob.core.windows.net/data/GLIPv1_Open/odinw/Aquarium/Aquarium%20Combined.v2-raw-1024.coco/test/annotations_without_background.json -O DATASET/odinw/Aquarium/Aquarium\ Combined.v2-raw-1024.coco/test/annotations_without_background.json wget https://penzhanwu2bbs.blob.core.windows.net/data/GLIPv1_Open/odinw/Aquarium/Aquarium%20Combined.v2-raw-1024.coco/train/annotations_without_background.json -O DATASET/odinw/Aquarium/Aquarium\ Combined.v2-raw-1024.coco/train/annotations_without_background.json wget https://penzhanwu2bbs.blob.core.windows.net/data/GLIPv1_Open/odinw/Aquarium/Aquarium%20Combined.v2-raw-1024.coco/valid/annotations_without_background.json -O DATASET/odinw/Aquarium/Aquarium\ Combined.v2-raw-1024.coco/valid/annotations_without_background.json
-
Then create a yaml file as in
configs/odinw/Aquarium_Aquarium_Combined.v2-raw-1024.coco.yaml
. A few fields to be noted in the yamls:DATASET.CAPTION_PROMPT allows manually changing the prompt (the default prompt is simply concatnating all the categories);
MODELS.*.NUM_CLASSES need to be set to the number of categories in the dataset (including the background class). E.g., Aquarium has 7 non-background categories thus MODELS.*.NUM_CLASSES is set to 8;
-
Run the following command to evaluate on the dataset. Set
{config_file}
,{model_checkpoint}
according to theModel Zoo
. Set {odinw_configs} to the path of the task yaml file we just prepared.
python tools/test_grounding_net.py --config-file {config_file} --weight {model_checkpoint} \
--task_config {odinw_configs} \
TEST.IMS_PER_BATCH 1 SOLVER.IMS_PER_BATCH 1 \
TEST.EVAL_TASK detection \
DATASETS.TRAIN_DATASETNAME_SUFFIX _grounding \
DATALOADER.DISTRIBUTE_CHUNK_AMONG_NODE False \
DATASETS.USE_OVERRIDE_CATEGORY True \
DATASETS.USE_CAPTION_PROMPT True
Prepare Flickr30K
data as in DATA.md. Set {config_file}
, {model_checkpoint}
according to the Model Zoo
.
python tools/test_grounding_net.py \
--config-file {config_file} \
--task_config configs/flickr/test.yaml,configs/flickr/val.yaml \
--weight {model_checkpoint} \
OUTPUT_DIR {output_dir} TEST.IMS_PER_BATCH 1 SOLVER.IMS_PER_BATCH 1 TEST.MDETR_STYLE_AGGREGATE_CLASS_NUM 100 TEST.EVAL_TASK grounding MODEL.DYHEAD.FUSE_CONFIG.MLM_LOSS False
Prepare the COCO
data as in DATA.md. Set {config_file}
, {model_checkpoint}
according to the Model Zoo
.
Below is the fine-tuning script for tuning the Tiny models:
python -m torch.distributed.launch --nproc_per_node=16 tools/train_net.py \
--config-file {config_file} \
--skip-test \
MODEL.WEIGHT {model_checkpoint} \
DATASETS.TRAIN '("coco_grounding_train", )' \
MODEL.BACKBONE.FREEZE_CONV_BODY_AT -1 SOLVER.IMS_PER_BATCH 32 SOLVER.USE_AMP True SOLVER.MAX_EPOCH 24 TEST.DURING_TRAINING False TEST.IMS_PER_BATCH 16 SOLVER.FIND_UNUSED_PARAMETERS False SOLVER.BASE_LR 0.00001 SOLVER.LANG_LR 0.00001 SOLVER.STEPS \(0.67,0.89\) DATASETS.DISABLE_SHUFFLE True MODEL.DYHEAD.SCORE_AGG "MEAN" TEST.EVAL_TASK detection
For evaluation, please follow the instructions in COCO Evaluation
. Scripts for tuning the Large model will be released soon.
Prepare the dataset as in ODinW / Custom Dataset Evaluation
.
For tuning with 1/3/5/10-shot, set {custom_shot_and_epoch_and_general_copy} to "1_200_8", "3_200_4", "5_200_2", "10_200_1", respectively.
For tuning with all the data, set {custom_shot_and_epoch_and_general_copy} to "0_200_1"; set SOLVER.STEP_PATIENCE to 2; set SOLVER.AUTO_TERMINATE_PATIENCE to 4.
python -m torch.distributed.launch --nproc_per_node=4 tools/finetune.py \
--config-file {config_file} --ft-tasks {configs} --skip-test \
--custom_shot_and_epoch_and_general_copy {custom_shot_and_epoch_and_general_copy} \
--evaluate_only_best_on_test --push_both_val_and_test \
MODEL.WEIGHT {model_checkpoint} \
SOLVER.USE_AMP True TEST.DURING_TRAINING True TEST.IMS_PER_BATCH 4 SOLVER.IMS_PER_BATCH 4 SOLVER.WEIGHT_DECAY 0.05 TEST.EVAL_TASK detection DATASETS.TRAIN_DATASETNAME_SUFFIX _grounding MODEL.BACKBONE.FREEZE_CONV_BODY_AT 2 MODEL.DYHEAD.USE_CHECKPOINT True SOLVER.FIND_UNUSED_PARAMETERS False SOLVER.TEST_WITH_INFERENCE True SOLVER.USE_AUTOSTEP True DATASETS.USE_OVERRIDE_CATEGORY True SOLVER.SEED 10 DATASETS.SHUFFLE_SEED 3 DATASETS.USE_CAPTION_PROMPT True DATASETS.DISABLE_SHUFFLE True \
SOLVER.STEP_PATIENCE 3 SOLVER.CHECKPOINT_PER_EPOCH 1.0 SOLVER.AUTO_TERMINATE_PATIENCE 8 SOLVER.MODEL_EMA 0.0 SOLVER.TUNING_HIGHLEVEL_OVERRIDE full
Follow the command as in Full Model Fine-Tuning
. But set the following hyper-parameters:
SOLVER.WEIGHT_DECAY 0.25 \
SOLVER.BASE_LR 0.05 \
SOLVER.TUNING_HIGHLEVEL_OVERRIDE language_prompt_v2
Please consider citing this paper if you use the code:
@inproceedings{li2021grounded,
title={Grounded Language-Image Pre-training},
author={Liunian Harold Li* and Pengchuan Zhang* and Haotian Zhang* and Jianwei Yang and Chunyuan Li and Yiwu Zhong and Lijuan Wang and Lu Yuan and Lei Zhang and Jenq-Neng Hwang and Kai-Wei Chang and Jianfeng Gao},
year={2022},
booktitle={CVPR},
}