Unexpected Performance: Single-Threaded Faster than Multi-Threaded in Point Cloud Alignment

Question

Unexpected Performance: Single-Threaded Faster than Multi-Threaded in Point Cloud Alignment

Closed this issue 8 months ago · 2 comments

Hello there! Thanks for your great work!
I had an issue when I deployed it on my pc. can anyone help me take a look? Thanks!!

Description

I have observed an unexpected performance behavior while using fast_gicp_mt. Specifically, the single-threaded versions of certain point cloud alignment algorithms such as GICP and NDT are outperforming their multi-threaded counterparts. This was observed while aligning two point clouds of sizes 17047 and 17334 points.

Environment

The repo is deployed using docker.
OS: Ubuntu20.04 + ROS Noetic
GPU: RTX4090 32GB
CPU: i9-13900KF
RAM: 32GB
I deployed the repo on WSL using Docker.

Details

The execution times for various algorithms were recorded, and it was noted that single-threaded implementations were consistently faster than multi-threaded ones. Below are some of the results obtained:

$ rosrun fast_gicp gicp_align 251370668.pcd 251371071.pcd
target:17047[pts] source:17334[pts]
--- pcl_gicp ---
single:110.186[msec] 100times:11059.9[msec] fitness_score:0.204892
--- pcl_ndt ---
single:39.1375[msec] 100times:4043.5[msec] fitness_score:0.229616
--- fgicp_st ---
single:101.371[msec] 100times:9945.61[msec] 100times_reuse:6586.6[msec] fitness_score:0.204376
--- fgicp_mt ---
single:135.229[msec] 100times:12986.9[msec] 100times_reuse:11950.3[msec] fitness_score:0.204384
--- vgicp_st ---
single:85.6506[msec] 100times:7514.18[msec] 100times_reuse:4194.52[msec] fitness_score:0.205022
--- vgicp_mt ---
single:158.688[msec] 100times:16300.5[msec] 100times_reuse:15309.5[msec] fitness_score:0.205022
--- ndt_cuda (P2D) ---
single:17.4151[msec] 100times:1702.9[msec] 100times_reuse:1340.19[msec] fitness_score:0.197208
--- ndt_cuda (D2D) ---
single:13.5261[msec] 100times:1391.88[msec] 100times_reuse:1119.26[msec] fitness_score:0.199985
--- vgicp_cuda (parallel_kdtree) ---
single:37.8372[msec] 100times:3054.31[msec] 100times_reuse:1987.94[msec] fitness_score:0.205017
--- vgicp_cuda (gpu_bruteforce) ---
single:65.4749[msec] 100times:3064.62[msec] 100times_reuse:2966.4[msec] fitness_score:0.249594
--- vgicp_cuda (gpu_rbf_kernel) ---
single:13.1453[msec] 100times:1515.33[msec] 100times_reuse:1119.99[msec] fitness_score:0.204766

Expected Behavior:

Typically, one would expect the multi-threaded implementations to be faster or at least as fast as the single-threaded ones, especially when dealing with large datasets.

Answer 1 · 2024-01-14T08:36:34.000Z

Hi everyone,

I wanted to share an update on the performance issue I was experiencing with the multi-threaded versions of point cloud alignment algorithms.

Initially, I was using the maximum thread count supported by my CPU (32 threads), but this setup was actually resulting in slower performance compared to the single-threaded implementations.

However, when I reduced the number of threads to 8, the processing times for the multi-threaded versions improved dramatically and became what one would expect - faster than the single-threaded versions. Here are the updated results:

$ rosrun fast_gicp gicp_align 251370668.pcd 251371071.pcd
target:17047[pts] source:17334[pts]
--- pcl_gicp ---
single:114.265[msec] 100times:11190.8[msec] fitness_score:0.204892
--- pcl_ndt ---
single:40.3903[msec] 100times:4108.75[msec] fitness_score:0.229616
--- fgicp_st ---
single:103.508[msec] 100times:10122.8[msec] 100times_reuse:6677.71[msec] fitness_score:0.204376
--- fgicp_mt ---
single:22.2643[msec] 100times:2076.86[msec] 100times_reuse:1322.39[msec] fitness_score:0.204384
--- vgicp_st ---
single:76.7637[msec] 100times:7601.88[msec] 100times_reuse:4227.26[msec] fitness_score:0.205022
--- vgicp_mt ---
single:16.8928[msec] 100times:1723.56[msec] 100times_reuse:964.225[msec] fitness_score:0.205022
--- ndt_cuda (P2D) ---
single:17.818[msec] 100times:1747.58[msec] 100times_reuse:1329.59[msec] fitness_score:0.197216
--- ndt_cuda (D2D) ---
single:13.9255[msec] 100times:1415.41[msec] 100times_reuse:1161.17[msec] fitness_score:0.199983
--- vgicp_cuda (parallel_kdtree) ---
single:36.8168[msec] 100times:2271.8[msec] 100times_reuse:1713.19[msec] fitness_score:0.205017
--- vgicp_cuda (gpu_bruteforce) ---
single:55.5222[msec] 100times:2822.75[msec] 100times_reuse:2615.85[msec] fitness_score:0.249594
--- vgicp_cuda (gpu_rbf_kernel) ---
single:14.8914[msec] 100times:1403.59[msec] 100times_reuse:941.221[msec] fitness_score:0.204766

It appears that using the maximum thread count was creating a bottleneck, possibly due to overheads associated with context switching or resource contention. Using a reduced thread count that better aligns with the CPU's capabilities and the workload's nature seems to be the key to optimal performance.

Answer 2 · 2024-01-15T02:04:28.000Z

Thanks for the helpful information. I will mention this in README.