/DesCo

Primary LanguagePythonMIT LicenseMIT

DesCo

This is the code for the paper DesCo: Learning Object Recognition with Rich Language Descriptions (NeurIPS 2023).

Checkout the huggingface demo at link.

Installation and Setup

Environment This repo requires Pytorch>=1.9 and torchvision. We recommend using docker to setup the environment. You can use this pre-built docker image docker pull pengchuanzhang/maskrcnn:ubuntu18-py3.7-cuda10.2-pytorch1.9 or this one docker pull pengchuanzhang/pytorch:ubuntu20.04_torch1.9-cuda11.3-nccl2.9.9 depending on your GPU.

Then install the following packages:

pip install einops shapely timm yacs tensorboardX ftfy prettytable pymongo sentence_transformers fastcluster openai transformers==4.11 wandb protobuf==3.20.1
python setup.py build develop --user

Models

Model LVIS MiniVal (AP) OmniLabel (AP) Config Weight
DesCo-GLIP (Tiny) 34.6 23.8 config weight
DesCo-FIBER (Base) 39.5 [1] 29.3 config weight

[1] For DesCo-FIBER, we find it benefitial to early stop for LVIS evaluation. Thus we provide both the best checkpoint for LVIS evaluation and the final checkpoint.

Quick Start

export GPUS=0
export CHECKPOINT=OUTPUTS/GLIP/desco_glip_tiny.pth
export CONFIG=configs/pretrain_new/desco_glip.yaml

CUDA_VISIBLE_DEVICES=$GPUS python tools/run_demo.py --config $CONFIG --weight $CHECKPOINT --image tools/pics/1.png --conf 0.5 --caption "a train besides sidewalk" --ground_tokens "train;sidewalk"

Just need a better example now

ground_tokens specifies which tokens we wish the model to ground to, separated by ;; if it is not specified, the script will use NLTK to extract noun phrases automatically.

Pre-Training

Below we provide scripts to pre-train DesCo-GLIP/FIBER. We used 8 A6000 GPUs for pre-training; learning rate should be adjusted according to the batch size.

Pre-Training DesCo-GLIP:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 \
    tools/train_net.py \
    --config-file configs/pretrain_new/glip.yaml  \
    --skip-test \
    --wandb_name GLIP \
    SOLVER.IMS_PER_BATCH 16 \
    SOLVER.BASE_LR 0.00005 \
    SOLVER.MAX_ITER 300000 \
    SOLVER.MAX_NEG_PER_BATCH 1.0

Pre-Training DesCo-FIBER:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 \
    tools/train_net.py \
    --config-file configs/pretrain_new/fiber.yaml  \
    --skip-test \
    --wandb_name FIBER \
    SOLVER.IMS_PER_BATCH 16 \
    SOLVER.BASE_LR 0.00002 \
    SOLVER.MAX_ITER 200000 \
    MODEL.WEIGHT MODEL/fiber_coarse_then_fine.pth \
    SOLVER.MAX_NEG_PER_BATCH 1.0

Notes:

  • In the current config, CC3M is not included. One could add bing_caption_train_no_coco to DATA.TRAIN to enable training on CC3M. Training without CC3M should give similar performance on OmniLabel and slightly worse performance on LVIS than reported in the paper.
  • Checkpoints will be saved to OUTPUTS/{wandb_name}; --use_wandb can be specified in the arguments to enable logging to wandb.
  • SOLVER.MAX_NEG_PER_BATCH needs to be set to 1.0 to enable training with full-negative prompts.

Evaluation on Benchmarks

LVIS

export GPUS=0,1,2,3,4,5,6,7
export GPU_NUM=8
export CHECKPOINT=OUTPUTS/GLIP/desco_glip_tiny.pth
export MODEL_CONFIG=configs/pretrain_new/glip.yaml

CUDA_VISIBLE_DEVICES=${GPUS}  python -m torch.distributed.launch --nproc_per_node=${GPU_NUM} \
    tools/test_grounding_net.py \
    --config-file ${MODEL_CONFIG} \
    --task_config configs/lvis/val.yaml \
    \
    --weight ${CHECKPOINT} \
    OUTPUT_DIR OUTPUTS/GLIP \
    TEST.EVAL_TASK detection  \
    TEST.CHUNKED_EVALUATION 8 TEST.IMS_PER_BATCH ${GPU_NUM} SOLVER.IMS_PER_BATCH ${GPU_NUM} TEST.MDETR_STYLE_AGGREGATE_CLASS_NUM 3000 MODEL.RETINANET.DETECTIONS_PER_IMG 300 MODEL.FCOS.DETECTIONS_PER_IMG 300 MODEL.ATSS.DETECTIONS_PER_IMG 300 MODEL.ROI_HEADS.DETECTIONS_PER_IMG 300  \
    DATASETS.OD_TO_GROUNDING_VERSION description.gpt.v10.infer.v1 \
    DATASETS.DESCRIPTION_FILE tools/files/lvis_v1.description.v1.json

Useful notes:

  • TEST.IMS_PER_BATCH should be equal to GPU_NUM; the current evaluation script only supports inference on 1 image per GPU.
  • Since there are over 1000 categories in TEST.CHUNKED_EVALUATION specifies how many categories we put into one prompt. Thus, we need to run the model multiple times for one image with different prompts. We recommend evaluating with 8 GPUs and evaluating on minival takes several hours.
  • DATASETS.OD_TO_GROUNDING_VERSION specifies how we convert the category names into descriptions. It is used in data/dataset/_od_to_description.py.
  • The default evaluation protocal uses the fixed AP.

OmniLabel

export GPUS=0
export GPU_NUM=1
export CHECKPOINT=OUTPUTS/GLIP/desco_glip_tiny.pth
export MODEL_CONFIG=configs/pretrain_new/glip.yaml

CUDA_VISIBLE_DEVICES=${GPUS} python -m torch.distributed.launch --nproc_per_node=${GPU_NUM} \
    tools/test_net_omnilabel.py \
    --config-file ${MODEL_CONFIG} \
    --weight ${CHECKPOINT} \
    --task_config configs/omnilabel/omnilabel_val_eval.yaml \
    --chunk_size 20 \
    OUTPUT_DIR OUTPUTS/${MODEL_NAME} \
    TEST.IMS_PER_BATCH ${GPU_NUM} \
    DATASETS.TEST "('omnilabel_val_coco',)"
  • Supported evaluation datasets (set by DATASETS.TEST) include omnilabel_val_o365, omnilabel_val_coco, omnilabel_val_oi_v5, and omnilabel_val.

Useful Notes

  • The core of DesCo is to construct the training prompts and maintain the correspondance between boxes and entities in the prompt; these functionalities are mostly implemented in data/datasets/_caption_aug.py and data/dataset/_od_to_description.py.

  • We have implemented several different versions of the prompt construction process. They are controlled by the OD_TO_GROUNDING_VERSION (for detection data such as Objects365), CAPTION_AUGMENTATION_VERSION (for gold grounding data such as GoldG and Flickr30K), and CC_CAPTION_AUGMENTATION_VERSION (for web data such as CC3M) fields in the config.

  • The rest of the code is similar to GLIP/FIBER. We made no changes to the architecture; thus the weights are compatable with GLIP/FIBER repo.