[filters] The relationship between Convolution3D running speed and number of threads

Question

[filters] The relationship between Convolution3D running speed and number of threads

Ru1yi opened this issue 2 months ago · 16 comments

Describe the bug
I was trying to use the Gaussian convolution filtering algorithm below the filters module. When I set the number of threads, I found that when setNumberOfThreads input 1, Gaussian convolution runs the fastest, and the more threads, the slower the running speed. Below is my source code:

auto st = std::chrono::high_resolution_clock::now();
pcl::filters::GaussianKernel<PointT, PointT> kernel;
// Set gaussian kernel
kernel.setSigma(4);
kernel.setThresholdRelativeToSigma(4);
kernel.setThreshold(0.05);
// Create kdtree
pcl::search::KdTree<PointT>::Ptr tree(new pcl::search::KdTree<PointT>);
tree->setInputCloud(cloudptr);
// Set convolution params
pcl::filters::Convolution3D<PointT, PointT, pcl::filters::GaussianKernel<PointT, PointT>> convolution;
convolution.setKernel(kernel);
convolution.setInputCloud(cloudptr);
convolution.setNumberOfThreads(1);
convolution.setSearchMethod(tree);
convolution.setRadiusSearch(0.01);
PointCloudPtr g_filtered(new PointCloudT);
convolution.convolve(*g_filtered);
cloudptr = g_filtered;
auto et = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> duration = et - st;
qDebug() << "[Filter] Gaussian filter time: " << duration.count() * 1000 << "ms";

Test Result

setNumberOfThreads	Average run time(ms)
1	1350
2	1700
4	2200
8	3100
default	7200

My Environment:

OS: Windows 11
IDE: VS2019
Compiler: MSVC2019
PCL Version 1.12.1
Original point cloud data：128(width)*1800(height) size：230400

Possible Solution

It looks like the time spent on thread management exceeds the time spent on the algorithm itself. So is it that my parameter settings are wrong or Convolution3D cannot be used for organized point clouds?

Answer 1 · 2024-09-11T12:57:48.000Z

@Ru1yi I did a quick test with a different cloud with GCC on Linux, and I got 5141ms - 4105ms - 2448ms - 1426ms (1 thread - 2 threads - 4 threads - 8 threads respectively).
How did you install PCL? Do you enable OpenMP while compiling your project? I don't have much experience with MSVC and OpenMP, but I do not think that thread management takes that long.

Answer 2 · 2024-09-12T03:27:20.000Z

I installed PCL using PCL-1.12.1-AllInOne-msvc2019-win64.exe from the official release. If I want to use PCL with OpenMP, do I need to configure anything specifically? I enabled the OPENMP support in the VS project configuration.
Here are the runtimes when I turn off OpenMP support：

setNumberOfThreads	Average run time(ms)
1	1350
2	1330
4	1350
8	1380

It looks like it was closed successfully. Could it be a problem with my data? Can you help me confirm the support of msvc2019 for openmp? Thanks a lot.

Answer 3 · 2024-09-12T08:06:42.000Z

@Ru1yi Can you post the point cloud you are using as a zipped PCD or PLY file? I will also try to test Convolution3D with MSVC as soon as I can.

Answer 4 · 2024-09-13T00:30:51.000Z

@mvieth 3644.500049090.zip Here is the data I used for testing, which was collected using the Robosense 128-line mechanical lidar. Thank you for your support.

Answer 5 · 2024-09-13T16:24:07.000Z

I did some testing (VS2022, PCL 1.13.0, with 3644.100042090.pcd), but I didn't notice any increasing run time with more threads. 2 threads are always faster than 1 thread. I did notice that at some point, more threads did not make it any faster, maybe around 4 threads (even though my computer has 6 physical cores with 2 hyperthreads each). I only had to enable OpenMP support at one place in the project configuration (set to Yes (/openmp)). Do you build in Debug or Release configuration? Do you run with or without debugging?

Answer 6 · 2024-09-18T07:05:29.000Z

Because I use Qt5 and VS2019 for joint development, the version of PCL is 1.12.1. If you use VS2022 and PCL1.13.0 or above, you need to develop with Qt6. In my previous tests, PCL1.12.1 does not support MSVC2022, and migrating the entire program to Qt6 is a large workload, so I want to know whether the version will affect this problem. I build this project in Debug and run with debugging.

Answer 7 · 2024-09-18T07:32:13.000Z

I don't think that PCL 1.12.1 vs PCL 1.13.0 or VS2019 vs VS2022 makes any difference for the multithreading performance. I would suggest to also test with your project built in Release configuration and run without debugging. I could imagine that those two lead to more thread management overhead.

Answer 8 · 2024-09-24T03:35:15.000Z

I tested the Release runtime in both debugging and non-debugging modes.

debugging mode

setNumberOfThreads	Average run time(ms)
1	210
2	180
4	210
8	300

non-debugging mode

setNumberOfThreads	Average run time(ms)
1	190
2	185
4	220
8	310

It seems that the running time is much shorter than in Debug mode and the shortest running time is achieved when the number of threads is 2. But I still don't see that increasing the number of threads will speed up the algorithm.

Answer 9 · 2024-09-24T08:44:58.000Z

@Ru1yi Okay, this is getting closer to the results I had on MSVC. At least now, using 2 threads is faster than using 1 thread, which is promising. Can you say which CPU you are testing on? (manufacturer, model number etc)

Answer 10 · 2024-09-25T10:12:56.000Z

I can verify that I get similiar results with VS2022 and pcl master.
I tried out VTune from intel to do a profiling of the example code and with 4 threads it looks like this:

Seemingly a lot of time goes on by allocating and deallocating resultvector in the flann usage - it creates a new array for its neighbors and allocates 1024 units(floats) of space each time.

https://github.com/flann-lib/flann/blob/f9caaf609d8b8cb2b7104a85cf59eb92c275a25d/src/cpp/flann/algorithms/nn_index.h#L626

https://github.com/flann-lib/flann/blob/f9caaf609d8b8cb2b7104a85cf59eb92c275a25d/src/cpp/flann/util/result_set.h#L435

Also, I tried visualize your source cloud and the result of the convolution which doesn't have a lot of difference. I think you might want to increase your radius and perhaps adjust the other parameters as well.

edit: update link to correct line in first link.

Answer 11 · 2024-09-25T10:13:52.000Z

Ohh and as number of threads increase, the more time is spent on allocating/deallocating and waiting 😢

Answer 12 · 2024-10-18T03:16:19.000Z

@Ru1yi Okay, this is getting closer to the results I had on MSVC. At least now, using 2 threads is faster than using 1 thread, which is promising. Can you say which CPU you are testing on? (manufacturer, model number etc)

This is the CPU model I used for testing：Intel(R) Core(TM) i7-14700HX 2.10 GHz

Answer 13 · 2024-10-18T10:58:58.000Z

This is the CPU model I used for testing：Intel(R) Core(TM) i7-14700HX 2.10 GHz

That is good to know, because then your CPU is one of the new models that has "Performance-cores" and "Efficient-cores" (see https://ark.intel.com/content/www/us/en/ark/products/235997/intel-core-i7-processor-14700hx-33m-cache-up-to-5-50-ghz.html ). Basically, P-cores are faster but more power hungry while E-cores are slower but more energy efficient. In contrast, my CPU is an older one where all cores are the same.

Maybe on your computer there is some weird effect that if you run with one thread, a P-core is used, but if you run with more threads, E-cores are used.

Here is some information I found about OpenMP with P-cores/E-cores:

The consensus seems to be to choose an appropriate work scheduling to keep all cores busy (dynamic schedule). Convolution3D currently does not specify any schedule, while usually means a static schedule is chosen by the compiler (all threads are assigned the same number of points to process). In fact, a dynamic schedule might even be beneficial for Convolution3D for CPUs where all cores are the same (like mine), because the radius search does not take the same amount of time for each point. I will investigate and open a pull request when I have time.

Answer 14 · 2024-10-19T05:54:19.000Z

This is the CPU model I used for testing：Intel(R) Core(TM) i7-14700HX 2.10 GHz

That is good to know, because then your CPU is one of the new models that has "Performance-cores" and "Efficient-cores" (see https://ark.intel.com/content/www/us/en/ark/products/235997/intel-core-i7-processor-14700hx-33m-cache-up-to-5-50-ghz.html ). Basically, P-cores are faster but more power hungry while E-cores are slower but more energy efficient. In contrast, my CPU is an older one where all cores are the same.

Maybe on your computer there is some weird effect that if you run with one thread, a P-core is used, but if you run with more threads, E-cores are used.

Here is some information I found about OpenMP with P-cores/E-cores:

https://stackoverflow.com/questions/75923768/openmp-threads-not-fully-utilize-cpu-cores-on-new-machine

https://stackoverflow.com/questions/69907312/openmp-with-new-ecores-pcores

https://www.reddit.com/r/cpp/comments/10gknbn/openmp_on_heterogenous_cpus/

The consensus seems to be to choose an appropriate work scheduling to keep all cores busy (dynamic schedule). Convolution3D currently does not specify any schedule, while usually means a static schedule is chosen by the compiler (all threads are assigned the same number of points to process). In fact, a dynamic schedule might even be beneficial for Convolution3D for CPUs where all cores are the same (like mine), because the radius search does not take the same amount of time for each point. I will investigate and open a pull request when I have time.

Thank you very much for your help. I will continue to pay attention to this issue when I have time.

Answer 15 · 2024-10-19T16:45:55.000Z

Also, I tried visualize your source cloud and the result of the convolution which doesn't have a lot of difference. I think you might want to increase your radius and perhaps adjust the other parameters as well.

I want to emphasise this comment by Lars again: I just checked and with the search radius setting from above (0.01) and the point clouds you uploaded, almost all radius searches find no neighbours, only the search point itself. Setting the search radius to 0.1 seems to be more appropriate, but probably even higher. Not sure what good settings for the gaussian kernel are.