This is the code for the paper DesCo: Learning Object Recognition with Rich Language Descriptions (NeurIPS 2023).
Checkout the huggingface demo at link.
Environment
This repo requires Pytorch>=1.9 and torchvision. We recommend using docker to setup the environment. You can use this pre-built docker image docker pull pengchuanzhang/maskrcnn:ubuntu18-py3.7-cuda10.2-pytorch1.9
or this one docker pull pengchuanzhang/pytorch:ubuntu20.04_torch1.9-cuda11.3-nccl2.9.9
depending on your GPU.
Then install the following packages:
pip install einops shapely timm yacs tensorboardX ftfy prettytable pymongo sentence_transformers fastcluster openai transformers==4.11 wandb protobuf==3.20.1
python setup.py build develop --user
Model | LVIS MiniVal (AP) | OmniLabel (AP) | Config | Weight |
---|---|---|---|---|
DesCo-GLIP (Tiny) | 34.6 | 23.8 | config | weight |
DesCo-FIBER (Base) | 39.5 [1] | 29.3 | config | weight |
[1] For DesCo-FIBER, we find it benefitial to early stop for LVIS evaluation. Thus we provide both the best checkpoint for LVIS evaluation and the final checkpoint.
export GPUS=0
export CHECKPOINT=OUTPUTS/GLIP/desco_glip_tiny.pth
export CONFIG=configs/pretrain_new/desco_glip.yaml
CUDA_VISIBLE_DEVICES=$GPUS python tools/run_demo.py --config $CONFIG --weight $CHECKPOINT --image tools/pics/1.png --conf 0.5 --caption "a train besides sidewalk" --ground_tokens "train;sidewalk"
Just need a better example now
ground_tokens
specifies which tokens we wish the model to ground to, separated by ;
; if it is not specified, the script will use NLTK to extract noun phrases automatically.
Below we provide scripts to pre-train DesCo-GLIP/FIBER. We used 8 A6000 GPUs for pre-training; learning rate should be adjusted according to the batch size.
Pre-Training DesCo-GLIP:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 \
tools/train_net.py \
--config-file configs/pretrain_new/glip.yaml \
--skip-test \
--wandb_name GLIP \
SOLVER.IMS_PER_BATCH 16 \
SOLVER.BASE_LR 0.00005 \
SOLVER.MAX_ITER 300000 \
SOLVER.MAX_NEG_PER_BATCH 1.0
Pre-Training DesCo-FIBER:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 \
tools/train_net.py \
--config-file configs/pretrain_new/fiber.yaml \
--skip-test \
--wandb_name FIBER \
SOLVER.IMS_PER_BATCH 16 \
SOLVER.BASE_LR 0.00002 \
SOLVER.MAX_ITER 200000 \
MODEL.WEIGHT MODEL/fiber_coarse_then_fine.pth \
SOLVER.MAX_NEG_PER_BATCH 1.0
Notes:
- In the current config, CC3M is not included. One could add
bing_caption_train_no_coco
toDATA.TRAIN
to enable training on CC3M. Training without CC3M should give similar performance on OmniLabel and slightly worse performance on LVIS than reported in the paper. - Checkpoints will be saved to
OUTPUTS/{wandb_name}
;--use_wandb
can be specified in the arguments to enable logging to wandb. SOLVER.MAX_NEG_PER_BATCH
needs to be set to1.0
to enable training with full-negative prompts.
export GPUS=0,1,2,3,4,5,6,7
export GPU_NUM=8
export CHECKPOINT=OUTPUTS/GLIP/desco_glip_tiny.pth
export MODEL_CONFIG=configs/pretrain_new/glip.yaml
CUDA_VISIBLE_DEVICES=${GPUS} python -m torch.distributed.launch --nproc_per_node=${GPU_NUM} \
tools/test_grounding_net.py \
--config-file ${MODEL_CONFIG} \
--task_config configs/lvis/val.yaml \
\
--weight ${CHECKPOINT} \
OUTPUT_DIR OUTPUTS/GLIP \
TEST.EVAL_TASK detection \
TEST.CHUNKED_EVALUATION 8 TEST.IMS_PER_BATCH ${GPU_NUM} SOLVER.IMS_PER_BATCH ${GPU_NUM} TEST.MDETR_STYLE_AGGREGATE_CLASS_NUM 3000 MODEL.RETINANET.DETECTIONS_PER_IMG 300 MODEL.FCOS.DETECTIONS_PER_IMG 300 MODEL.ATSS.DETECTIONS_PER_IMG 300 MODEL.ROI_HEADS.DETECTIONS_PER_IMG 300 \
DATASETS.OD_TO_GROUNDING_VERSION description.gpt.v10.infer.v1 \
DATASETS.DESCRIPTION_FILE tools/files/lvis_v1.description.v1.json
Useful notes:
TEST.IMS_PER_BATCH
should be equal toGPU_NUM
; the current evaluation script only supports inference on 1 image per GPU.- Since there are over 1000 categories in
TEST.CHUNKED_EVALUATION
specifies how many categories we put into one prompt. Thus, we need to run the model multiple times for one image with different prompts. We recommend evaluating with 8 GPUs and evaluating onminival
takes several hours. DATASETS.OD_TO_GROUNDING_VERSION
specifies how we convert the category names into descriptions. It is used indata/dataset/_od_to_description.py
.- The default evaluation protocal uses the fixed AP.
export GPUS=0
export GPU_NUM=1
export CHECKPOINT=OUTPUTS/GLIP/desco_glip_tiny.pth
export MODEL_CONFIG=configs/pretrain_new/glip.yaml
CUDA_VISIBLE_DEVICES=${GPUS} python -m torch.distributed.launch --nproc_per_node=${GPU_NUM} \
tools/test_net_omnilabel.py \
--config-file ${MODEL_CONFIG} \
--weight ${CHECKPOINT} \
--task_config configs/omnilabel/omnilabel_val_eval.yaml \
--chunk_size 20 \
OUTPUT_DIR OUTPUTS/${MODEL_NAME} \
TEST.IMS_PER_BATCH ${GPU_NUM} \
DATASETS.TEST "('omnilabel_val_coco',)"
- Supported evaluation datasets (set by
DATASETS.TEST
) includeomnilabel_val_o365
,omnilabel_val_coco
,omnilabel_val_oi_v5
, andomnilabel_val
.
-
The core of DesCo is to construct the training prompts and maintain the correspondance between boxes and entities in the prompt; these functionalities are mostly implemented in
data/datasets/_caption_aug.py
anddata/dataset/_od_to_description.py
. -
We have implemented several different versions of the prompt construction process. They are controlled by the
OD_TO_GROUNDING_VERSION
(for detection data such as Objects365),CAPTION_AUGMENTATION_VERSION
(for gold grounding data such as GoldG and Flickr30K), andCC_CAPTION_AUGMENTATION_VERSION
(for web data such as CC3M) fields in the config. -
The rest of the code is similar to GLIP/FIBER. We made no changes to the architecture; thus the weights are compatable with GLIP/FIBER repo.