[Question]Stream and the best practice of thrust STL usage.
ZhenshengLee opened this issue ยท 4 comments
cuda_stream is about concurrency of kernel functions.
According to https://github.com/neka-nat/cupoch/blob/master/src/cupoch/geometry/pointcloud.h, each pointcloud
has three vectors that are points, normals and colors. Usage of stream can be applied to pointcloud
.
But I found that not all functions apply this policy.
For example, passthroughFilter uses only the default stream, and downsample function uses 3 streams to perform operations in each vector.
see
cupoch/src/cupoch/geometry/pointcloud.cu
Line 382 in 9b4859f
cupoch/src/cupoch/geometry/down_sample.cu
Line 251 in 9b4859f
@neka-nat Could you explain the reason why you choose to do so?
Or are there any drawbacks to use streams in cuda?
Thanks.
Thanks!
The use of three streams, which is used by UniformDownSample, is probably ineffective.
Speeding up with streams requires asynchronous processing, but thrust copy function is synchronous.
If you change to the default stream, the performance will not change much.
NVIDIA/thrust#827 (comment)
Switching to async::copy may speed up the process.
Thanks!
The use of three streams, which is used by UniformDownSample, is probably ineffective.
Speeding up with streams requires asynchronous processing, but thrust copy function is synchronous.
If you change to the default stream, the performance will not change much.
NVIDIA/thrust#827 (comment)Switching to async::copy may speed up the process.
Thanks.
AFAK, async::copy is not supported in thrust, but cudaMemcpyAsync
with thrust::raw_pointer_cast
would be a workaround way accoring to NVIDIA/thrust#827 (comment), and this method is implemented in perception_cupoch with this commit ZhenshengLee/perception_cupoch@8ba8b91
From NVIDIA/thrust#827 (comment) it seems that most of algorithms of thrust would be blocking except for for_each
, may std::async would help to simplify the code.
From NVIDIA/thrust#827 (comment) pinned_memory would also help to get async copy, but pinned_memory is not a good option for iGPU devices.
I used async::copy
to test if async is enabled.
The implementation using async::copy
is given in the latest master.
The nvvp results show that the async is working well.
The calculation time was also faster using async::copy.
import time
import cupoch as cph
if __name__ == "__main__":
print("Load a ply point cloud, print it, and render it")
pcd = cph.io.read_point_cloud("../../testdata/icp/cloud_bin_2.pcd")
cph.visualization.draw_geometries([pcd])
start = time.time()
for _ in range(100):
uni_down_pcd = pcd.uniform_down_sample(every_k_points=5)
print(time.time() - start)
cph.visualization.draw_geometries([uni_down_pcd])
# sync: 0.025475502014160156
# async: 0.018369436264038086
I have checked with https://github.com/NVIDIA/thrust/blob/main/CHANGELOG.md ,
async::copy and other async algorithms are enabled since Thrust 1.9.4 (CUDA Toolkit 10.1)
This policy can be easily applied to all codes that uses these algorithms.
thrust::async::reduce.
thrust::async::reduce_into, which takes a target location to store the reduction result into.
thrust::async::copy, including a two-policy overload that allows explicit cross system copies which execution policy properties can be attached to.
thrust::async::transform.
thrust::async::for_each.
thrust::async::stable_sort.
thrust::async::sort.
Great! for jetpack 4.4.1, the version is 1.9.7-1 CUDA Toolkit 10.2 for Tegra