/PromptDet

PromptDet: Towards Open-vocabulary Detection using Uncurated Images, ECCV2022

Primary LanguagePythonApache License 2.0Apache-2.0

PromptDet: Towards Open-vocabulary Detection using Uncurated Images (ECCV 2022)

Paper     Website

Introduction

The goal of this work is to establish a scalable pipeline for expanding an object detector towards novel/unseen categories, using zero manual annotations. To achieve that, we make the following four contributions: (i) in pursuit of generalisation, we propose a two-stage open-vocabulary object detector, where the class-agnostic object proposals are classified with a text encoder from pre-trained visual-language model; (ii) To pair the visual latent space (of RPN box proposals) with that of the pre-trained text encoder, we propose the idea of regional prompt learning to align the textual embedding space with regional visual object features; (iii) To scale up the learning procedure towards detecting a wider spectrum of objects, we exploit the available online resource via a novel self-training framework, which allows to train the proposed detector on a large corpus of noisy uncurated web images. Lastly, (iv) to evaluate our proposed detector, termed as PromptDet, we conduct extensive experiments on the challenging LVIS and MS-COCO dataset. PromptDet shows superior performance over existing approaches with fewer additional training images and zero manual annotations whatsoever.

Training framework

method overview

updates

  • July 20, 2022: add the code for LAION-novel and self-training
  • March 28, 2022: initial release

Prerequisites

  • MMDetection version 2.16.0.

  • Please see get_started.md for installation and the basic usage of MMDetection.

Regional Prompt Learning (RPL)

We learn the prompt vectors in an off-line manner using RPL. For your convenience, we also provide the learned prompt vectors and the category embeddings.

LAION-novel dataset

The LAION-novel dataset based on the learned category embeddings can be generated by using the PromptDet tools as follows:

# stege-I: install the dependencies, download the laion400m 64GB image.index and metadata.hdf5 (https://the-eye.eu/public/AI/cah/), and then retrival the LAION images (urls)
pip install faiss-cpu==1.7.2 img2dataset==1.12.0 fire==0.4.0 h5py==3.6.0
python tools/promptdet/retrieval_laion_image.py --indice-folder [laion400m-64GB-index] --metadata [metadata.hdf5] --text-features promptdet_resources/lvis_category_embeddings.pt --output-folder data/laion_lvis/images --num-images 500

# stege-II: download the LAION images
python tools/promptdet/download_laion_image.py --output-folder data/laion_lvis/images --num-thread 10

# stege-III: convert the LAION images to mmdetection format
python tools/promptdet/laion_dataset_converter.py --data-path data/laion_lvis/images --out-file data/laion_lvis/laion_train.json --topK 300

For your convenience, we also provide the image urls of our LAION-novel dataset.

Inference

# assume that you are under the root directory of this project,
# and you have activated your virtual environment if needed,
# and with LVIS v1.0 dataset in 'data/lvis_v1'.

./tools/dist_test.sh configs/promptdet/promptdet_r50_fpn_sample1e-3_mstrain_1x_lvis_v1_self_train.py work_dirs/promptdet_r50_fpn_sample1e-3_mstrain_1x_lvis_v1_self_train.pth 4 --eval bbox segm

Train

# download 'lvis_v1_train_seen.json' to 'data/lvis_v1/annotations'.

# train detector without self-training
./tools/dist_train.sh configs/promptdet/promptdet_r50_fpn_sample1e-3_mstrain_1x_lvis_v1.py 4

# train detector with self-training
./tools/dist_train.sh configs/promptdet/promptdet_r50_fpn_sample1e-3_mstrain_1x_lvis_v1_self_train.py 4

[0] Annotation file of base categories: lvis_v1_train_seen.json.
[1] Note that we provide a EpochPromptDetRunner to fetch the data from mutilple datasets alternately.

Models

For your convenience, we provide the following trained models (PromptDet) with mask AP.

Model RPL Self-training Epochs Scale Jitter Input Size APnovel APc APf AP Download
Baseline (manual prompt) 12 640~800 800x800 7.4 17.2 26.1 19.0 google
PromptDet_R_50_FPN_1x 12 640~800 800x800 11.5 19.4 26.7 20.9 google
PromptDet_R_50_FPN_1x 12 640~800 800x800 19.5 18.2 25.6 21.3 google
PromptDet_R_50_FPN_6x 72 100~1280 800x800 21.7 23.2 29.6 25.5 google

[0] All results are obtained with a single model and without any test time data augmentation such as multi-scale, flipping and etc..
[1] Refer to more details in config files in config/promptdet/.

Acknowledgement

Thanks MMDetection team for the wonderful open source project!

Citation

If you find PromptDet useful in your research, please consider citing:

@inproceedings{feng2022promptdet,
    title={PromptDet: Towards Open-vocabulary Detection using Uncurated Images},
    author={Feng, Chengjian and Zhong, Yujie and Jie, Zequn and Chu, Xiangxiang and Ren, Haibing and Wei, Xiaolin and Xie, Weidi and Ma, Lin},
    journal={Proceedings of the European Conference on Computer Vision},
    year={2022}
}