This is the third party implementation of the paper Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection by Zuwei Long and Wei Li.

You can use this code to fine-tune a model on your own dataset, or start pretraining a model from scratch.

Supported Features
Setup
Dataset
Config
Training
Results and Models
Inference
Acknowledgments
Citation
Contact

Supported Features

	Official release version	The version we replicated
Inference	✔	✔
Train (Objecet Detection data)	✖	✔
Train (Grounding data)	✖	✔
Slurm multi-machine support	✖	✔
Training acceleration strategy	✖	✔

Setup

We conduct our model testing using the following versions: Python 3.7.11, PyTorch 1.11.0, and CUDA 11.3. It is possible that other versions are also available.

Clone this repository.

git clone https://github.com/longzw1997/Open-GroundingDino.git && cd Open-GroundingDino/

Install the required dependencies.

pip install -r requirements.txt 
cd models/GroundingDINO/ops
python setup.py build install
python test.py
cd ../../..

Download pre-trained model and BERT weights, then modify the corresponding paths in the train/test script.

Dataset

For training, we use the odvg data format to support both OD data and VG data.
Before model training begins, you need to convert your dataset into odvg format, see data_format.md | datasets_mixed_odvg.json | coco2odvg.py | grit2odvg for more details.

For testing, we use coco format, which currently only supports OD datasets.

mixed dataset

{
  "train": [
    {
      "root": "path/V3Det/",
      "anno": "path/V3Det/annotations/v3det_2023_v1_all_odvg.jsonl",
      "label_map": "path/V3Det/annotations/v3det_label_map.json",
      "dataset_mode": "odvg"
    },
    {
      "root": "path/LVIS/train2017/",
      "anno": "path/LVIS/annotations/lvis_v1_train_odvg.jsonl",
      "label_map": "path/LVIS/annotations/lvis_v1_train_label_map.json",
      "dataset_mode": "odvg"
    },
    {
      "root": "path/Objects365/train/",
      "anno": "path/Objects365/objects365_train_odvg.json",
      "label_map": "path/Objects365/objects365_label_map.json",
      "dataset_mode": "odvg"
    },
    {
      "root": "path/coco_2017/train2017/",
      "anno": "path/coco_2017/annotations/coco2017_train_odvg.jsonl",
      "label_map": "path/coco_2017/annotations/coco2017_label_map.json",
      "dataset_mode": "odvg"
    },
    {
      "root": "path/GRIT-20M/data/",
      "anno": "path/GRIT-20M/anno/grit_odvg_620k.jsonl",
      "dataset_mode": "odvg"
    }, 
    {
      "root": "path/flickr30k/images/flickr30k_images/",
      "anno": "path/flickr30k/annotations/flickr30k_entities_odvg_158k.jsonl",
      "dataset_mode": "odvg"
    }
  ],
  "val": [
    {
      "root": "path/coco_2017/val2017",
      "anno": "config/instances_val2017.json",
      "label_map": null,
      "dataset_mode": "coco"
    }
  ]
}

example for odvg dataset

# For OD
{"filename": "000000391895.jpg", "height": 360, "width": 640, "detection": {"instances": [{"bbox": [359.17, 146.17, 471.62, 359.74], "label": 3, "category": "motorcycle"}, {"bbox": [339.88, 22.16, 493.76, 322.89], "label": 0, "category": "person"}, {"bbox": [471.64, 172.82, 507.56, 220.92], "label": 0, "category": "person"}, {"bbox": [486.01, 183.31, 516.64, 218.29], "label": 1, "category": "bicycle"}]}}
{"filename": "000000522418.jpg", "height": 480, "width": 640, "detection": {"instances": [{"bbox": [382.48, 0.0, 639.28, 474.31], "label": 0, "category": "person"}, {"bbox": [234.06, 406.61, 454.0, 449.28], "label": 43, "category": "knife"}, {"bbox": [0.0, 316.04, 406.65, 473.53], "label": 55, "category": "cake"}, {"bbox": [305.45, 172.05, 362.81, 249.35], "label": 71, "category": "sink"}]}}

# For VG
{"filename": "014127544.jpg", "height": 400, "width": 600, "grounding": {"caption": "Homemade Raw Organic Cream Cheese for less than half the price of store bought! It's super easy and only takes 2 ingredients!", "regions": [{"bbox": [5.98, 2.91, 599.5, 396.55], "phrase": "Homemade Raw Organic Cream Cheese"}]}}
{"filename": "012378809.jpg", "height": 252, "width": 450, "grounding": {"caption": "naive : Heart graphics in a notebook background", "regions": [{"bbox": [93.8, 47.59, 126.19, 77.01], "phrase": "Heart graphics"}, {"bbox": [2.49, 1.44, 448.74, 251.1], "phrase": "a notebook background"}]}}

Config

config/cfg_odvg.py                   # for backbone, batch size, LR, freeze layers, etc.
config/datasets_mixed_odvg.json      # support mixed dataset for both OD and VG

Training

Before starting the training, you need to modify the config/datasets_mixed_example.json according to data_format.md.
The evaluation code defaults to using coco_val2017 for evaluation. If you are evaluating with your own test set, you need to convert the test data to coco format (not the ovdg format in data_format.md), and modify the config to set use_coco_eval = False (The COCO dataset has 80 classes used for training but 90 categories in total, so there is a built-in mapping in the code). Also, update the label_list in the config with your own class names like label_list=['dog', 'cat', 'person'].

# train/eval on torch.distributed.launch:
bash train_dist.sh  ${GPU_NUM} ${CFG} ${DATASETS} ${OUTPUT_DIR}
bash test_dist.sh  ${GPU_NUM} ${CFG} ${DATASETS} ${OUTPUT_DIR}

# train/eval on slurm cluster：
bash train_slurm.sh  ${PARTITION} ${GPU_NUM} ${CFG} ${DATASETS} ${OUTPUT_DIR}
bash test_slurm.sh  ${PARTITION} ${GPU_NUM} ${CFG} ${DATASETS} ${OUTPUT_DIR}
# e.g.  check train_slurm.sh for more details
# bash train_slurm.sh v100_32g 32 config/cfg_odvg.py config/datasets_mixed_odvg.json ./logs
# bash train_slurm.sh v100_32g 8 config/cfg_coco.py config/datasets_od_example.json ./logs

Results and Models

Name	Pretrain data	Task	mAP on COCO	Ckpt	Misc
GroundingDINO-T (offical)	O365,GoldG,Cap4M	zero-shot	48.4 (zero-shot)	model	-
GroundingDINO-T (fine-tune)	O365,GoldG,Cap4M	finetune w/ coco	57.3 (fine-tune)	model	cfg \| log
GroundingDINO-T (pretrain)	COCO,O365,LIVS,V3Det, GRIT-200K,Flickr30k(total 1.8M)	zero-shot	55.1 (zero-shot)	model	cfg \| log

GRIT-200K generated by GLIP and spaCy.

Inference

Because the model architecture has not changed, you only need to install GroundingDINO library and then run inference_on_a_image.py to inference your images.

python tools/inference_on_a_image.py \
  -c tools/GroundingDINO_SwinT_OGC.py \
  -p path/to/your/ckpt.pth \
  -i ./figs/dog.jpeg \
  -t "dog" \
  -o output

Prompt	Official ckpt	COCO ckpt	1.8M ckpt
dog
cat

Acknowledgments

Provided codes were adapted from:

Citation

@misc{Open Grounding Dino,
  author = {Zuwei Long, Wei Li},
  title = {Open Grounding Dino:The third party implementation of the paper Grounding DINO},
  howpublished = {\url{https://github.com/longzw1997/Open-GroundingDino}},
  year = {2023}
}

Contact

longzuwei at sensetime.com
liwei1 at sensetime.com

Feel free to contact we if you have any suggestions or questions. Bugs found are also welcome. Please create a pull request if you find any bugs or want to contribute code.

June01/Open-GroundingDino