/ViP-Object-Detection

This repository contains the official implementation to reproduce object detection results of ViP.

Primary LanguagePythonApache License 2.0Apache-2.0

Visual Parser: Representing Part-whole Hierarchies with Transformers

This repository contains the official implementation to reproduce object detection results of ViP. It is based on mmdetection.

Results and Models

Cascade Mask R-CNN

Backbone Pretrain Lr Schd box mAP mask mAP #params FLOPs config log model
ViP-Ti ImageNet-1K 1x 45.3 39.8 69.2M 678G config Google Drive Google Drive
ViP-S ImageNet-1K 1x 48.0 42.0 87.1M 725G config Google Drive Google Drive
ViP-M ImageNet-1K 1x 49.9 43.5 107.0M 785G - - Coming Soon

RetinaNet

Backbone Pretrain Lr Schd box mAP #params FLOPs config log model
ViP-Ti ImageNet-1k 1x 39.9 21.4M 181G config Google Drive Google Drive
ViP-S ImageNet-1k 1x 42.7 39.9M 227G config Google Drive Google Drive
ViP-S ImageNet-1k 3x 43.9 39.9M 227G config Google Drive Google Drive
ViP-M ImageNet-1k 1x 44.3 59.8M 287G - - Coming Soon

Notes:

Usage

Installation

Please refer to get_started.md for installation and dataset preparation.

Inference

# single-gpu testing
python tools/test.py <CONFIG_FILE> <DET_CHECKPOINT_FILE> --eval bbox segm

# multi-gpu testing
tools/dist_test.sh <CONFIG_FILE> <DET_CHECKPOINT_FILE> <GPU_NUM> --eval bbox segm

Training

To train a detector with pre-trained models, run:

# single-gpu training
python tools/train.py <CONFIG_FILE>

# multi-gpu training
tools/dist_train.sh <CONFIG_FILE> <GPU_NUM>

Citing ViP

@article{sun2021visual,
  title={Visual Parser: Representing Part-whole Hierarchies with Transformers},
  author={Sun, Shuyang and Yue, Xiaoyu, Bai, Song and Torr, Philip},
  journal={arXiv preprint arXiv:2107.05790},
  year={2021}
}