/aliked-tensorrt

TensorRT implementation of ALIKED

Primary LanguagePythonBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

ALIKED TensorRT

This is an TensorRT implementation of ALIKED.

Conversion approach includes two "tricks":

  • Adds support for custom DeformConv onnx conversion due to onnx opset version mismatch: torch.onnx currently supports opset 18 (see here), while DeformConv was added in opset 19. Custom DeformConv was adopted from here.
  • Instead of using get_patches from custom_ops, get_patches was implemented in PyTorch with available operations.

NEW! See fp16 (AMP) results below!

Top-K or Score Threshold

Model has two options when filtering keypoints: either you get top-k keypoints, accoding to score, or you get all the keypoints with score larger than threshold. Since top-k approach gives fixed number of outputs, this approach is selected for TensorRT conversion: thresholding results in unknown number of outputs and is tricky for TensorRT conversion.

Additionally, after fetching top-k keypoints, user can always manually reject those with score lower then threshold.

Tools

For PyTorch and TensorRT:

  • container: nvcr.io/nvidia/pytorch:23.12-py3
  • PyTorch: 2.2.0a0+81ea7a4
  • TensorRT: 8.6.1
  • Torch-TensorRT: 2.2.0a0
  • ONNX: 1.15.0rc2
  • GPUs: GTX 1660 TI (nvidia-driver: 545.29.06) and RTX 2070 (nvidia-driver:530.41.03)

For PyTorch torch.compile() (different container because torch.compile() is in PyTorch nightly build):

  • container: nvcr.io/nvidia/cuda:12.1.1-devel-ubuntu22.04
  • PyTorch: 2.2.1+cu118
  • GPUs: GTX 1660 TI (nvidia-driver: 545.29.06)

Timings

Timings are currently done using data from assets. Before measuring inference time, models go through warm-up - do couple of inferences in order to fully initialize memory on GPU. This initialization takes time, and first inferences take longer than usual because of it. Hence, after doing warm-up, there is no need to reject first few timing measurements.

Rows:

  • model - ALIKED model type
  • K - top_k
  • image: Image sizes of 640x480 and 1241x376 reflect TUM and Kitti image sizes from assets dir.

Columns:

  • TRT - TensorRT
  • PyT - PyTorch
  • PyT.c - PyTorch with torch.compile()
  • ms - Mean inference time in miliseconds.
  • MiB - GPU memory consumption as shown by nvidia-smi.
GTX 1660 Ti Mobile GTX 1660 Ti Mobile GTX 1660 Ti Mobile RTX 2070
TRT (ms,MiB) PyT (ms, MiB) PyT.c (ms, MiB) TRT (ms,MiB)
model=t16, K=1000, image=640x480 11.62, 356 15.13, 866 10.52, 740 9.73, 404
model=t16, K=2000, image=640x480 13.55, 364 16.32, 858 10.75, 792 11.04, 404
model=t16, K=1000, image=1241x376 15.88, 468 19.20, 1222 14.23, 1110 13.33, 532
model=t16, K=2000, image=1241x376 18.76, 474 20.30, 1240 14.66, 1114 15.01, 526
model=n16rot, K=1000, image=640x480 17.66, 558 20.99, 1490 14.81, 1240 14.42, 600
model=n16rot, K=2000, image=640x480 21.72, 552 24.58, 1514 15.00, 1314 17.18, 604
model=n16rot, K=1000, image=1241x376 25.42, 788 27.28, 2204 21.39, 1884 21.86, 818
model=n16rot, K=2000, image=1241x376 29.53, 782 30.48, 2228 21.78, 1932 23.53, 824

Timings for FP16

Following table shows results for PyTorch AMP (Automatix Mixed Precision) and TensorRT (also utilizing half precision) inference.

By utilising AMP, model uses less memory due to some weights being converted to half-precision data type (fp16), and, potentially, the inference is faster.

FP16 slower and GTX-1660-Ti Tensor Cores?

Note: Measurements are done on GTX-1660-Ti which doesnt't have Tensor Cores which are needed for utilizing half-precision data types and making the inference faster.

Hence, PyTorch execution of AMP comes with smaller footprint when compared to original PyTorch, but the inference is slower.

On the other hand, TensorRT achieves faster inference and smaller memory footprint even on GTX 1660 Ti.

Data Precision Loss?

Even though AMP tries to reduce the precision loss, you still get different outputs between original and AMP model.

After visualizing the outputs (keypoints) and comparing the outputs of both models (original and AMP), it can be seen that differences are negligible.

The interesting part when comparing the outputs are descriptors, since they are, actually, multidimensional vectors. When compared in a element-wise manner, results are not optimal - there are differences. But when you inspect the vector similarity by dot product, all their dot products are ~1 (descriptors here are normalized).

That means that vector directions are preserved and that those descriptors are still usefull.

But it would be good to check it on downstream tasks like homography, pose ... (in TODOs).

Check the compare_fp32_fp16.py for ouputs visualization and comparison.

Sort the outputs before comparing

It's important to sort the data before comparing, because due to half-precision roundings, models output different keypoints: e.g., in top_k=1000 config, AMP model misses only ~3 keypoints which the original model calculated.

Hence, by calculating the mutual nearest neighbours of keypoints (by the x,y coordinates), we find the keypoints common in both outputs.

Check the find_mutual_closest_keypoints() method in compare_fp32_fp16.py.

Timigs for FP16

GTX 1660 Ti Mobile GTX 1660 Ti Mobile GTX 1660 Ti Mobile
TRT.AMP (ms,MiB) PyT (ms, MiB) PyT.AMP (ms, MiB)
model=t16, K=1000, image=640x480 9.17, 280 15.13, 866 21.93, 544
model=t16, K=2000, image=640x480 10.84, 280 16.32, 858 25.07, 546
model=t16, K=1000, image=1241x376 TBM 19.20, 1222 28.03, 796
model=t16, K=2000, image=1241x376 TBM 20.30, 1240 30.96, 812
model=n16rot, K=1000, image=640x480 TBM 20.99, 1490 35.85, 1020
model=n16rot, K=2000, image=640x480 TBM 24.58, 1514 44.04, 1052
model=n16rot, K=1000, image=1241x376 TBM 27.28, 2204 47.46, 1536
model=n16rot, K=2000, image=1241x376 TBM 30.48, 2228 54.86, 1542

Convert to ONNX and TensorRT.

To convert model to onnx from pytorch:

$ python convert_pytorch_to_onnx.py \
    assets/tum \
    --model aliked-n16rot \
    --model_output converted_model/aliked-n16rot-top1k-tum.onnx \
    --opset_version 17 \
    --verbose \
    --top_k 1000

To convert model from onnx to TensorRT:

$ python convert_onnx_to_trt.py \
    --model_onnx_path converted_model/aliked-n16rot-top1k-tum.onnx \
    --model_trt_path converted_model/aliked-n16rot-top1k-tum.trt

Measure

To measure timings, use measure_timings.py script which accepts same args as demo_pair.py

TODO

  • Refactor code for easier speed measuring.
  • Add warm-up.
  • Add auto-fetching of gpu memory consumption.
  • Measure speed and memory in:
    • tensorrt
    • origial pytorch
    • pytorch.compile
    • pytorch.amp
    • tensorrt + pytorch.amp
    • pytorch-tensorrt
    • onnx-gpu
  • Sort outputs before comparing, during conversion in onnx.
  • Investigate the speed of custom_ops get_patches and my get_patches.
  • Add more data for measuring.
  • Add more measurements.
  • Add C++ impl.
  • Use NVIDIA Triton inference?
  • Check AMP data on down-stream tasks (e.g. homography, pose, ...)