/SurroundOcc

[arxiv 2023] Multi-camera 3D Occupancy Prediction for Autonomous Driving

Primary LanguagePythonApache License 2.0Apache-2.0

SurroundOcc


SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving
Yi Wei*, Linqing Zhao*, Wenzhao Zheng, Zheng Zhu, Jiwen Lu, Jie Zhou

News

  • [2022/3/21]: Support for private data. You can try both occupancy prediction method and ground truth generation pipeline on your own data.
  • [2022/3/17]: Initial code and paper release.
  • [2022/2/27]: Demo release.

Demo

Demos are a little bit large; please wait a moment to load them. If you cannot load them or feel them blurry, you can click the hyperlink of each demo for the full-resolution raw video. Welcome to the home page for more demos and detailed introductions.

Introduction

Towards a more comprehensive and consistent scene reconstruction, in this paper, we propose a SurroundOcc method to predict the volumetric occupancy with multi-camera images. We first extract multi-scale features for each image and adopt spatial cross attention to lift them to the 3D volume space. Then we apply 3D convolutions to progressively upsample the volume features and impose supervision on multiple levels. To train the multi-camera 3D scene reconstruction model, we design a pipeline to generate dense occupancy ground truth with sparse LiDAR points. The generation pipeline only needs existed 3D detection and 3D semantic segmentation labels without extra human annotations. Specifically, we fuse multi-frame LiDAR points of dynamic objects and static scenes separately. Then we adopt Poisson Reconstruction to fill the holes and voxelize the mesh to get dense volumetric occupancy.

Method

Method Pipeline:

Occupancy Ground Truth Generation Pipeline:

Getting Started

You can download our pretrained model for 3D semantic occupancy prediction and 3D scene reconstruction tasks. The difference is whether use semantic labels to train the model. The models are trained on 8 RTX 3090s with about 2.5 days.

Try your own data

Occupancy prediction

You can try our nuScenes pretrained model on your own data! Here we give a template in-the-wild data and pickle file. You should place it in ./data and change the corresponding infos. Specifically, you need to change the 'lidar2img', 'intrinsic' and 'data_path' as the extrinsic matrix, intrinsic matrix and path of your multi-camera images. Note that the order of frames should be same to their timestamps. 'occ_path' in this pickle file indicates the save path and you will get raw results (.npy) and point coulds (.ply) in './visual_dir' for further visualization. You can use meshlab to directly visualize .ply files. Or you can run tools/visual.py to visualize .npy files.

./tools/dist_inference.sh ./projects/configs/surroundocc/surroundocc_inference.py ./path/to/ckpts.pth 8

Ground truth generation

You can also generate dense occupancy labels with your own data! We provide a highly extensible code to achieve this. We provide an example sequence and you need to prepare your data like this:

your_own_data_folder/
├── pc/
│   ├── pc0.npy
│   ├── pc1.npy
│   ├── ...
├── bbox/
│   ├── bbox0.npy (bounding box of the object)
│   ├── bbox1.npy
│   ├── ...
│   ├── object_category0.npy (semantic category of the object)
│   ├── object_category1.npy
│   ├── ...
│   ├── boxes_token0.npy (Unique bbox codes used to combine the same object in different frames)
│   ├── boxes_token1.npy
│   ├── ...
├── calib/
│   ├── lidar_calibrated_sensor0.npy
│   ├── lidar_calibrated_sensor1.npy
│   ├── ...
├── pose/
│   ├── lidar_ego_pose0.npy
│   ├── lidar_ego_pose1.npy
│   ├── ...

You can generate occupancy labels with or without semantics (via acitivating --with semantic). If your LiDAR is high-resolution, e.g. RS128, LiVOX and M1, you can skip Poisson reconstruction step and the generation processe will be very fast! You can change the point cloud range and voxel size in config.yaml. You can use multithreading to boost the generation process.

cd $Home/tools/generate_occupancy_nuscenes
python process_your_own_data.py --to_mesh --with_semantic --data_path $your_own_data_folder$ --len_sequence $frame number$

You can use --whole_scene_to_mesh to generate a complete static scene with all frames at one time, then add the moving object point cloud, and finally divide it into small scenes. In this way, we can accelerate the generation process and get denser but more uneven occupancy labels.

Acknowledgement

Many thanks to these excellent projects:

Related Projects:

Bibtex

If this work is helpful for your research, please consider citing the following BibTeX entry.

@article{wei2023surroundocc, 
      title={SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving}, 
      author={Yi Wei and Linqing Zhao and Wenzhao Zheng and Zheng Zhu and Jie Zhou and Jiwen Lu},
      journal={arXiv preprint arXiv:2303.09551},
      year={2023}
}