CAPE: Camera View Position Embedding for Multi-View 3D Object Detection (CVPR2023)

This repository is an official implementation of CAPE

CAPE is a simple yet effective method for multi-view 3D object detection. CAPE forms the 3D position embedding under the local camera-view system rather than the global coordinate system, which largely reduces the difficulty of the view transformation learning. And CAPE supports temporal modeling by exploiting the fusion between separated queries for multi frames.

Preparation

This implementation is built upon PETR, and can be constructed as the install.md.

Environments
Linux, Python==3.7.9, CUDA == 11.2, pytorch == 1.9.1, mmdet3d == 0.17.1
Detection Data
Follow the mmdet3d to process the nuScenes dataset (https://github.com/open-mmlab/mmdetection3d/blob/master/docs/en/data_preparation.md).
Pretrained weights
To verify the performance on the val set, we provide the pretrained V2-99 weights. The V2-99 is pretrained on DDAD15M (weights) and further trained on nuScenes train set with FCOS3D. For the results on test set in the paper, we use the DD3D pretrained weights. The ImageNet pretrained weights of other backbone can be found here. Please put the pretrained weights into ./ckpts/.

After preparation, you will be able to see the following directory structure:

CAPE
├── mmdetection3d
├── projects
│   ├── configs
│   ├── mmdet3d_plugin
├── tools
├── data
│   ├── nuscenes
│     ├── samples
│     ├── ...
├── ckpts
├── README.md

Train & inference

cd CAPE

You can train the model following:

sh train.sh

You can evaluate the model following:

sh test.sh

Main Results

config	mAP	NDS	config	download
cape_r50_1408x512_24ep_wocbgs_imagenet_pretrain	34.7%	40.6%	config	log / checkpoint
capet_r50_704x256_24ep_wocbgs_imagenet_pretrain	31.8%	44.2%	config	log / checkpoint
capet_VoV99_800x320_24ep_wocbgs_load_dd3d_pretrain	44.7%	54.36%	config	log / checkpoint

Acknowledgement

Many thanks to the authors of mmdetection3d. Special thanks to the authors of PETR.

Citation

If you find this project useful for your research, please consider citing:

@article{Xiong2023CAPE,
  title={CAPE: Camera View Position Embedding for Multi-View 3D Object Detection},
  author={Xiong, Kaixin and Gong, Shi and Ye, Xiaoqing and Tan, Xiao and Wan, Ji and Ding, Errui and Wang, Jingdong and Bai, Xiang},
  booktitle={Computer Vision and Pattern Recognition},
  year={2023}
}

Contact

If you have any questions, feel free to open an issue or contact us at kaixinxiong@hust.edu.cn or gongshi@baidu.com or yexiaoqing@baidu.com.

kaixinbear/CAPE