/UniTR

[ICCV2023] Official Implementation of "UniTR: A Unified and Efficient Multi-Modal Transformer for Bird’s-Eye-View Representation"

Primary LanguagePythonApache License 2.0Apache-2.0

UniTR: The First Unified Multi-modal Transformer Backbone for 3D Perception

This repo is the official implementation of ICCV2023 paper: UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation as well as the follow-ups. Our UniTR achieves state-of-the-art performance on nuScenes Dataset with a real unified and weight-sharing multi-modal (e.g., Cameras and LiDARs) backbone. UniTR is built upon the codebase of DSVT, we have made every effort to ensure that the codebase is clean, concise, easily readable, state-of-the-art, and relies only on minimal dependencies.

UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation

Haiyang Wang*, Hao Tang*, Shaoshuai Shi $^\dagger$, Aoxue Li, Zhenguo Li, Bernt Schiele, Liwei Wang $^\dagger$

Contact: Haiyang Wang (wanghaiyang6@stu.pku.edu.cn), Hao Tang (tanghao@stu.pku.edu.cn), Shaoshuai Shi (shaoshuaics@gmail.com)

🚀 Gratitude to Tang Hao for extensive code refactoring and noteworthy contributions to open-source initiatives. His invaluable efforts were pivotal in ensuring the seamless completion of UniTR.

🔥 👀 Honestly, the partition in Unitr is slow and takes about 40% of the total time, but this can be optimized to zero with better strategies or some engineering efforts, indicating that there is still huge room for speed optimization. We're not the HPC experts, but if anyone in the industry wants to improve this, we believe it could be halved. Importantly, this part doesn't scale with model size, making it friendly for larger models.

📘 I am going to share my understanding and future plan of the general 3D perception foundation model without reservation. Please refer to 🔥 Potential Research🔥 . If you find it useful for your research or inspiring, feel free to join me in building this blueprint.

Interpretive Articles: [CVer] [自动驾驶之心] [ReadPaper] [知乎] [CSDN] [TechBeat (将门创投)]

News

  • [24-08-12] 🔥 GiT was accepted by ECCV2024 with oral presentation. Hope you enjoy the success of plain transformer family.
  • [24-07-01] 🔥 Our GiT was accepted by ECCV2024. If you find it helpful, please give it a star. 🤗
  • [24-03-15] 🔥 GiT, the first successful general vision model only using a ViT is released. Corresponding to Potential Research, we attempted to address problems with the general model on the vision side. Combining UniTR and GiT to construct an LLM-like unified model suitable for autonomous driving scenarios is an intriguing direction.
  • [23-09-21] 🚀 Code of NuScenes is released.
  • [23-08-16] 🏆 SOTA Our single multi-modal UniTR outshines all other non-TTA approaches on nuScenes Detection benchmark (Aug 2023) in terms of NDS 74.5.
  • [23-08-16] 🏆 SOTA performance of multi-modal 3D object detection and BEV Map Segmentation on NuScenes validation set.
  • [23-08-15] 👀 UniTR is released on arXiv.
  • [23-07-13] 🔥 UniTR is accepted at ICCV 2023.

Overview

TODO

  • Release the arXiv version.
  • SOTA performance of multi-modal 3D object detection (Nuscenes) and BEV Map Segmentation (Nuscenes).
  • Clean up and release the code of NuScenes.
  • Merge UniTR to OpenPCDet.

Introduction

Jointly processing information from multiple sensors is crucial to achieving accurate and robust perception for reliable autonomous driving systems. However, current 3D perception research follows a modality-specific paradigm, leading to additional computation overheads and inefficient collaboration between different sensor data.

In this paper, we present an efficient multi-modal backbone for outdoor 3D perception, which processes a variety of modalities with unified modeling and shared parameters. It is a fundamentally task-agnostic backbone that naturally supports different 3D perception tasks. It sets a new state-of-the-art performance on the nuScenes benchmark, achieving +1.1 NDS higher for 3D object detection and +12.0 mIoU higher for BEV map segmentation with lower inference latency.

Main results

3D Object Detection (on NuScenes validation)

Model NDS mAP mATE mASE mAOE mAVE mAAE ckpt Log
UniTR 73.0 70.1 26.3 24.7 26.8 24.6 17.9 ckpt Log
UniTR+LSS 73.3 70.5 26.0 24.4 26.8 24.8 18.7 ckpt Log

3D Object Detection (on NuScenes test)

Model NDS mAP mATE mASE mAOE mAVE mAAE
UniTR 74.1 70.5 24.4 23.3 25.7 24.1 13.0
UniTR+LSS 74.5 70.9 24.1 22.9 25.6 24.0 13.1

Bev Map Segmentation (on NuScenes validation)

Model mIoU Drivable Ped.Cross. Walkway StopLine Carpark Divider ckpt Log
UniTR 73.2 90.4 73.1 78.2 66.6 67.3 63.8 ckpt Log
UniTR+LSS 74.7 90.7 74.0 79.3 68.2 72.9 64.2 ckpt Log

What's new here?

🔥 Beats previous SOTAs of outdoor multi-modal 3D Object Detection and BEV Segmentation

Our approach has achieved the best performance on multiple tasks (e.g., 3D Object Detection and BEV Map Segmentation), and it is highly versatile, requiring only the replacement of the backbone.

3D Object Detection
BEV Map Segmentation

🔥 Weight-Sharing among all modalities

We introduce a modality-agnostic transformer encoder to handle these view-discrepant sensor data for parallel modal-wise representation learning and automatic cross-modal interaction without additional fusion steps.

🔥 Prerequisite for 3D vision foundation models

A weight-shared unified multimodal encoder is a prerequisite for foundation models, especially in the context of 3D perception, unifying information from both images and LiDAR data. This is the first truly multimodal fusion backbone, seamlessly connecting to any 3D detection head.

Quick Start

Installation

conda create -n unitr python=3.8
# Install torch, we only test it in pytorch 1.10
pip install torch==1.10.1+cu113 torchvision==0.11.2+cu113 -f https://download.pytorch.org/whl/torch_stable.html

git clone https://github.com/Haiyang-W/UniTR
cd UniTR

# Install extra dependency
pip install -r requirements.txt

# Install nuscenes-devkit
pip install nuscenes-devkit==1.0.5

# Develop
python setup.py develop

Dataset Preparation

OpenPCDet
├── data
│   ├── nuscenes
│   │   │── v1.0-trainval (or v1.0-mini if you use mini)
│   │   │   │── samples
│   │   │   │── sweeps
│   │   │   │── maps
│   │   │   │── v1.0-trainval  
├── pcdet
├── tools
  • (optional) To install the Map expansion for bev map segmentation task, please download the files from Map expansion (Map expansion pack (v1.3)) and copy the files into your nuScenes maps folder, e.g. /data/nuscenes/v1.0-trainval/maps as follows:
OpenPCDet
├── maps
│   ├── ......
│   ├── boston-seaport.json
│   ├── singapore-onenorth.json
│   ├── singapore-queenstown.json
│   ├── singapore-hollandvillage.json
  • Generate the data infos by running the following command (it may take several hours):
# Create dataset info file, lidar and image gt database
python -m pcdet.datasets.nuscenes.nuscenes_dataset --func create_nuscenes_infos \
    --cfg_file tools/cfgs/dataset_configs/nuscenes_dataset.yaml \
    --version v1.0-trainval \
    --with_cam \
    --with_cam_gt \
    # --share_memory # if use share mem for lidar and image gt sampling (about 24G+143G or 12G+72G)
# share mem will greatly improve your training speed, but need 150G or 75G extra cache mem. 
# NOTE: all the experiments used share memory. Share mem will not affect performance
  • The format of the generated data is as follows:
OpenPCDet
├── data
│   ├── nuscenes
│   │   │── v1.0-trainval (or v1.0-mini if you use mini)
│   │   │   │── samples
│   │   │   │── sweeps
│   │   │   │── maps
│   │   │   │── v1.0-trainval  
│   │   │   │── img_gt_database_10sweeps_withvelo
│   │   │   │── gt_database_10sweeps_withvelo
│   │   │   │── nuscenes_10sweeps_withvelo_lidar.npy (optional) # if open share mem
│   │   │   │── nuscenes_10sweeps_withvelo_img.npy (optional) # if open share mem
│   │   │   │── nuscenes_infos_10sweeps_train.pkl  
│   │   │   │── nuscenes_infos_10sweeps_val.pkl
│   │   │   │── nuscenes_dbinfos_10sweeps_withvelo.pkl
├── pcdet
├── tools

Training

Please download pretrained checkpoint from unitr_pretrain.pth and copy the file under the root folder, eg. UniTR/unitr_pretrain.pth. This file is the weight of pretraining DSVT on Imagenet and Nuimage datasets.

3D object detection:

# multi-gpu training
## normal
cd tools
bash scripts/dist_train.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr.yaml --sync_bn --pretrained_model ../unitr_pretrain.pth --logger_iter_interval 1000

## add lss
cd tools
bash scripts/dist_train.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr+lss.yaml --sync_bn --pretrained_model ../unitr_pretrain.pth --logger_iter_interval 1000

BEV Map Segmentation:

# multi-gpu training
# note that we don't use image pretrain in BEV Map Segmentation
## normal
cd tools
bash scripts/dist_train.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr_map.yaml --sync_bn --eval_map --logger_iter_interval 1000

## add lss
cd tools
bash scripts/dist_train.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr_map+lss.yaml --sync_bn --eval_map --logger_iter_interval 1000

Testing

3D object detection:

# multi-gpu testing
## normal
cd tools
bash scripts/dist_test.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr.yaml --ckpt <CHECKPOINT_FILE>

## add LSS
cd tools
bash scripts/dist_test.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr+lss.yaml --ckpt <CHECKPOINT_FILE>

BEV Map Segmentation

# multi-gpu testing
## normal
cd tools
bash scripts/dist_test.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr_map.yaml --ckpt <CHECKPOINT_FILE> --eval_map

## add LSS
cd tools
bash scripts/dist_test.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr_map+lss.yaml --ckpt <CHECKPOINT_FILE> --eval_map
# NOTE: evaluation results will not be logged in *.log, only be printed in the teminal

Cache Testing

  • 🔥If the camera and Lidar parameters of the dataset you are using remain constant, then using our cache mode will not affect performance. You can even cache all mapping calculations during the training phase, which can significantly accelerate your training speed.
  • Each sample in Nuscenes will have some variations in camera parameters, and during normal inference, we disable the cache mode to ensure result accuracy. However, due to the robustness of our mapping, even in scenarios with camera parameter variations like Nuscenes, the performance will only drop slightly (around 0.4 NDS).
  • Cache mode only supports batch_size 1 now, 8x1=8
  • Backbone caching will reduce 40% inference latency in our observation.
# Only for 3D Object Detection
## normal
### cache the mapping computation of multi-modal backbone
cd tools
bash scripts/dist_test.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr_cache.yaml --ckpt <CHECKPOINT_FILE> --batch_size 8

## add LSS
### cache the mapping computation of multi-modal backbone
cd tools
bash scripts/dist_test.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr+LSS_cache.yaml --ckpt <CHECKPOINT_FILE> --batch_size 8

## add LSS
### cache the mapping computation of multi-modal backbone and LSS
cd tools
bash scripts/dist_test.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr+LSS_cache_plus.yaml --ckpt <CHECKPOINT_FILE> --batch_size 8

Performance of cache testing on NuScenes validation (some variations in camera parameters)

Model NDS mAP mATE mASE mAOE mAVE mAAE
UniTR (Cache Backbone) 72.6(-0.4) 69.4(-0.7) 26.9 24.8 26.3 24.6 18.2
UniTR+LSS (Cache Backbone) 73.1(-0.2) 70.2(-0.3) 25.8 24.4 26.0 25.3 18.2
UniTR+LSS (Cache Backbone and LSS) 72.6(-0.7) 69.3(-1.2) 26.7 24.3 25.9 25.3 18.2

Potential Research

  • Infrastructure of 3D Vision Foundation Model. An efficient network design is crucial for large models. With a reliable model structure, the development of large models can be advanced. How to make a general multimodal backbone more efficient and easy to deploy. Honestly, the partition in Unitr is slow and takes about 40% of the total time, but this can be optimized to zero with better partition strategies or some engineering efforts, indicating that there is still huge room for speed optimization. We're not the HPC experts, but if anyone in the industry wants to improve this, we believe it could be halved. Importantly, this part doesn't scale with model size, making it friendly for larger models.
  • Multi-Modal Self-supervised Learning based on Image-Lidar pair and UniTR. Please refer to the following figure. The images and point clouds both describe the same 3D scene; they complement each other in terms of highly informative correspondence. This allows for the unsupervised learning of more generic scene representation with shared parameters.
  • Single-Modal Pretraining. Our model is almost the same as ViT (except for some position embedding strategies). If we adjust the position embedding appropriately, DSVT and UniTR can directly load the pretrained parameters of ViT. This is beneficial for better integration with the 2D community.
  • Unifide Modeling of 3D Vision. Please refer to the following figure.

Possible Issues

  • If you encounter a gradient that becomes NaN during fp16 training, not support.
  • If you couldn’t find a solution, search open and closed issues in our github issues page here.
  • We provide torch checkpoints option here in training stage by default for saving CUDA memory 50%.
  • Samples in Nuscenes have some variations in camera parameters. So, during training, every sample recalculates the camera-lidar mapping, which significantly slows down the training speed (~40%). If the extrinsic parameters in your dataset are consistent, I recommend caching this computation during training.
  • If still no-luck, open a new issue in our github. Our turnaround is usually a couple of days.

Citation

Please consider citing our work as follows if it is helpful.

@inproceedings{wang2023unitr,
    title={UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation},
    author={Haiyang Wang, Hao Tang, Shaoshuai Shi, Aoxue Li, Zhenguo Li, Bernt Schiele, Liwei Wang},
    booktitle={ICCV},
    year={2023}
}

Acknowledgments

UniTR uses code from a few open source repositories. Without the efforts of these folks (and their willingness to release their implementations), UniTR would not be possible. We thanks these authors for their efforts!