/locov

Localized Vision-Language Matching for Open-vocabulary Object Detection

Primary LanguagePythonApache License 2.0Apache-2.0

LocOV: Localized Vision-Language Matching for Open-vocabulary Object Detection

News

2022-07 (v0.1): This repository is the official PyTorch implementation of our GCPR 2022 paper: Localized Vision-Language Matching for Open-vocabulary Object Detection

Table of Contents

Installation

Requirements

  • Linux or macOS with Python ≥ 3.6
  • PyTorch ≥ 1.8. Install them together at pytorch.org to make sure of this. Note, please check the PyTorch version matches the one required by Detectron2 and your CUDA version.
  • Detectron2: follow Detectron2 installation instructions.

Originally the code was tested on python=3.8.13, torch=1.10.0, cuda=11.2 and OS Ubuntu 20.04.

git clone https://github.com/lmb-freiburg/locov.git
cd locov

Prepare datasets

Download datasets

  • Download MS COCO training and validation datasets. Download detection and caption annotations for retrieval from the original page.
  • Save the data in datasets_data
  • Run the script to create the annotation subsets that include only base and novel categories
python tools/convert_annotations_to_ov_sets.py

Precompute the text features

  • Run the script to save and calculate the object embeddings.
python tools/coco_bert_embeddings.py

Precomputed generic object proposals

  • Train OLN on MSCOCO known classes and extract the proposals for all the training set.
  • Or download the precomputed proposals for MSCOCO Train on known classes only Proposals (3.9GB)

Train and validate Open Vocabulary Detection

Model Outline

Method

Useful script commands

Train LSM stage

Run the script to train the Localized Semantic Matching stage

python train_ovnet.py --num-gpus 8 --resume --config-file configs/coco_lsm.yaml 

Train STT stage

Run the script to train the Localized Semantic Matching stage

python train_ovnet.py --num-gpus 8 --resume --config-file configs/coco_stt.yaml MODEL.WEIGHTS path_to_final_weights_lsm_stage

Evaluate

python train_ovnet.py --num-gpus 8 --resume --eval-only --config-file configs/coco_stt.yaml \
MODEL.WEIGHTS output/model-weights.pth \
OUTPUT_DIR output/eval_locov

Benchmark results

Models zoo

Pretrained models can be found in the models directory

Model AP-novel AP50-novel AP-known AP50-known AP-general AP50-general Weights
LocOv 17.219 30.109 33.499 53.383 28.129 45.719 LocOv

Acknowledgements

This work was supported by Deutscher Akademischer Austauschdienst - German Academic Exchange Service (DAAD) Research Grants - Doctoral Programmes in Germany, 2019/20; grant number: 57440921.

The Deep Learning Cluster used in this work is partially funded by the German Research Foundation (DFG) - 417962828.

We especially thank the creators of the following github repositories for providing helpful code:

  • Zareian et al. for their open-vocabulary setup and code: OVR-CNN

License

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

Citation

If you use our repository or find it useful in your research, please cite the following paper:

@InProceedings{Bravo2022locov,
  author       = "M. Bravo and S. Mittal and T. Brox",
  title        = "Localized Vision-Language Matching for Open-vocabulary Object Detection",
  booktitle    = "German Conference on Pattern Recognition (GCPR) 2022",
  year         = "2022"
}