In this repo, we provide a ros wrapper for lightweight yet powerful 3D object detection with TensorRT inference backend for real-time robotic applications.
- It is effective and efficient, achieving 5 ms runtime and 85% 3D Car mAP@R40.
- we chose IA-SSD as baseline since its high efficiency. Further, HAVSampler and GridBallQuery are adopted to gain 1000x faster than FPS and original BallQuery, respectively.
- we implement TensorRT plugins for NMS postprocessing and some common-to-use operators of point-based point cloud detector, e.g., sampling, grouping, gather.
- [2022/04/14]: This repository implements GridBallQuery with a computational complexity of
$\mathcal{O}(NK^3)$ , instead of$\mathcal{O}(NM)$ of BallQuery. - [2022/04/08]: Support INT8 quantization and Profiler.
we test on the platform:
- ubuntu18.0 with GPU 2080Ti
- python3.7
- pytorch1.12
- cuda11.0
- cudnn8.4
- tensorrt8.4.0
You should follow the official guidance to install the above dependencies at first, and then build this package.
export CUDNN_DIR=/path/to/cudnn/root
export TENSORRT_DIR=/path/to/tensorrt/root
mkdir -p build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DTRT_QUANTIZE=FP16 -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc
make -j$(nproc)
or build as normal ros package.
We test exported model with TensorRT in KITTI val set and report the results AP_3D@R11/R40 as following:
Car | Pedestrian | Cyclist | Runtime | |
---|---|---|---|---|
FP32 | 83.8752 / 84.9749 | 53.9177 / 53.1046 | 67.2500 / 67.1609 | 10 ms |
FP16 | 80.2896 / 80.8535 | 53.0247 / 51.4732 | 67.8503 / 68.3627 | 8 ms |
INT8 | 77.7286 / 79.3178 | 52.2956 / 50.7517 | 68.3595 / 68.3880 | 9 ms |
Unexpectedly, the runtime in INT8 mode is higher than that in FP16. This may be due to the fact that we did not implement INT8 format for the custom layer and the point cloud model has less large block computation.
we also profile the model in different precisions, read this for details.
Car | Pedestrian | Cyclist | Runtime | |
---|---|---|---|---|
FP32 | 6 ms | |||
FP16 | 5 ms | |||
INT8 |
It receives msgs from sensor_msgs::PointCloud2 /points
and publishes visualization_msgs::MarkerArray /objects
.
./devel/lib/point_detection/point_detector
we offer another utils script to publish point clouds from .bin
files.
python src/pcvt.py -s bin -d topic -t /points -p /home/nrsl/Downloads/velodyne_points/data
When build engine with INT8 mode, it throwscuda configuration error
during calibration. Therefore, only FP32 and FP16 mode can be used.
Feel free to contact us if the source codes of pytorch models are required.
- consider use cuda graph to reduce the latency introduced by launching too much kernel.
- use dynamic parallelism to avoid cpu-based loop in HAVSampling.