ZSD-YOLO is a zero-shot detector initially described in this paper based on the popular YOLOv5 detector that leverages CLIP vision language models to perform embedding alignment. Our paper describes a self-labeling method and modified nms operation tailored to the zero-shot detection (ZSD) problem also implemented in this repository.
ZSD-YOLO achieves state-of-the-art performance across three main ZSD benchmarks while also being significantly more lightweight.
Install
Python>=3.6.0 is required with all [requirements.txt] installed including PyTorch>=1.7:
$ git clone https://github.com/Johnathan-Xie/ZSD-YOLO
$ cd ZSD-YOLO
$ pip install -r requirements.txt
$ pip install git+https://github.com/openai/CLIP.git
Download
Datasets are setup in typical YOLOv5 format with paired image, label folders, and a .yaml metadata file, except in our repository labels are stored in torch .pt files to allow for offline image embedding computation.
All data, model_checkpoints, and embeddings can be found at this kaggle dataset link. Download and place the folders in their corresponding locations (your local repository layout should match the layout of the kaggle repository). Alternatively you can just download the entire repository from the kaggle link and follow the above install instructions for all the functionality.
COCO
Zero Shot Detection (ZSD) mAP
$ python3 test.py --weights weights/model_checkpoints/yolov5x_coco_65_15_zsd_self.pt --data data/coco/coco_zsd_2014_test_65_15.yaml --img 640 --save-json --zsd --annot-folder labels2014_zsd_self_test_l_65_15 --obj-conf-thresh 0.1 --batch-size 20 --iou-thres 0.4 --conf-thres 0.001 --max-det 15 --verbose --exist-ok --plot-conf 0.1
5s 48_17 48_17 48_17
5m
Generalized Zero Shot Detection (GZSD) mAP
$ python3 test.py --weights weights/model_checkpoints/yolov5x_coco_65_15_zsd_self.pt --data data/coco/coco_gzsd_2014_65_15.yaml --img 640 --save-json --zsd --annot-folder labels2014_gzsd_65_15 --obj-conf-thresh 0.1 --batch-size 20 --iou-thres 0.4 --conf-thres 0.001 --max-det 45 --text-embedding-path embeddings/all_coco_text_embeddings_65_15.pt --eval-splits unseen_names seen_names --verbose --exist-ok --plot-conf 0.1 --eval-by-splits
48_17 48_17 48_17 48_17
ILSVRC
$ python3 test.py --weights weights/model_checkpoints/yolov5x_ilsvrc_zsd_self.pt --data data/ILSVRC/ilsvrc_zsd_test.yaml --annot-folder ilsvrc_zsd_labels --zsd --batch-size 32 --verbose --max-det 2 --iou-thres 0.4 --obj-conf-thresh 0.2 --plot-conf 0.3 --img 320
VG
$ python3 test.py --weights weights/model_checkpoints/yolov5x_vg_zsd_self.pt --data data/vg/vg_zsd_test.yaml --annot-folder labels_zsd_test --zsd --batch-size 12 --verbose --max-det 100 --iou-thres 0.4 --obj-conf-thresh 0.01 --plot-conf 0.01
COCO
$ python3 train.py --data data/coco/coco_zsd_2014_val_65_15.yaml --weights weights/pretrained_weights/yolov5x_coco_65_15_pretrain.pt --cfg models/yolov5x-ZSD-65-15.yaml --hyp hyp.finetune_coco_zsd_self_evolved.yaml --annot-folder labels2014_zsd_self_val_l_65_15 --epochs 50 --zsd --batch-size 12 --verbose --max-det 15 --iou-thres 0.4 --workers 12 --obj-conf-thresh 0.1 --plot-conf 0.1
test_48_17.yaml 5m 5m 48-17 test_l_48_17
5s 5s
3 3
Note: For COCO 65/15 training, results will be shown for our proposed validation set. For true ZSD testing results, run the testing experiment with the test set shown above.
ILSVRC
$ python3 train.py --data data/ILSVRC/ilsvrc_zsd_test.yaml --weights weights/pretrained_weights/yolov5x_zsd_ilsvrc_pretrain.pt --cfg models/yolov5x-ZSD-ILSVRC.yaml --hyp hyp.finetune_coco_zsd_self_evolved.yaml --annot-folder ilsvrc_zsd_labels --epochs 50 --zsd --batch-size 32 --iou-thres 0.4 --workers 12 --obj-conf-thresh 0.2 --plot-conf 0.3 --img-size 320 --max-det 2
VG
$ python3 train.py --data data/vg/vg_zsd_test.yaml --weights weights/pretrained_weights/yolov5x_vg_pretrain.pt --cfg models/yolov5x-ZSD-VG.yaml --hyp hyp.finetune_vg_zsd_self.yaml --annot-folder labels_zsd_test --epochs 100 --zsd --batch-size 12 --verbose --max-det 100 --iou-thres 0.4 --workers 12 --obj-conf-thresh 0.01 --plot-conf 0.01
@inproceedings{xie2022zero,
title={Zero-shot object detection through vision-language embedding alignment},
author={Xie, Johnathan and Zheng, Shuai},
booktitle={2022 IEEE International Conference on Data Mining Workshops (ICDMW)},
pages={1--15},
year={2022},
organization={IEEE}
}