neka-nat/cupoch

[Question]Stream and the best practice of thrust STL usage.

ZhenshengLee opened this issue ยท 4 comments

cuda_stream is about concurrency of kernel functions.

According to https://github.com/neka-nat/cupoch/blob/master/src/cupoch/geometry/pointcloud.h, each pointcloud has three vectors that are points, normals and colors. Usage of stream can be applied to pointcloud.

But I found that not all functions apply this policy.

For example, passthroughFilter uses only the default stream, and downsample function uses 3 streams to perform operations in each vector.

see

std::shared_ptr<PointCloud> PointCloud::PassThroughFilter(int axis_no,

std::shared_ptr<PointCloud> PointCloud::UniformDownSample(

@neka-nat Could you explain the reason why you choose to do so?

Or are there any drawbacks to use streams in cuda?

Thanks.

Thanks!
The use of three streams, which is used by UniformDownSample, is probably ineffective.
Speeding up with streams requires asynchronous processing, but thrust copy function is synchronous.
If you change to the default stream, the performance will not change much.
NVIDIA/thrust#827 (comment)

Switching to async::copy may speed up the process.

Thanks!
The use of three streams, which is used by UniformDownSample, is probably ineffective.
Speeding up with streams requires asynchronous processing, but thrust copy function is synchronous.
If you change to the default stream, the performance will not change much.
NVIDIA/thrust#827 (comment)

Switching to async::copy may speed up the process.

Thanks.

AFAK, async::copy is not supported in thrust, but cudaMemcpyAsync with thrust::raw_pointer_cast would be a workaround way accoring to NVIDIA/thrust#827 (comment), and this method is implemented in perception_cupoch with this commit ZhenshengLee/perception_cupoch@8ba8b91

From NVIDIA/thrust#827 (comment) it seems that most of algorithms of thrust would be blocking except for for_each, may std::async would help to simplify the code.

From NVIDIA/thrust#827 (comment) pinned_memory would also help to get async copy, but pinned_memory is not a good option for iGPU devices.

I used async::copy to test if async is enabled.
The implementation using async::copy is given in the latest master.
The nvvp results show that the async is working well.

Use thrust::copy
sync_copy

Use thrust::async::copy
async_copy

The calculation time was also faster using async::copy.

import time
import cupoch as cph

if __name__ == "__main__":
    print("Load a ply point cloud, print it, and render it")
    pcd = cph.io.read_point_cloud("../../testdata/icp/cloud_bin_2.pcd")
    cph.visualization.draw_geometries([pcd])

    start = time.time()
    for _ in range(100):
        uni_down_pcd = pcd.uniform_down_sample(every_k_points=5)
    print(time.time() - start)
    cph.visualization.draw_geometries([uni_down_pcd])
    # sync: 0.025475502014160156
    # async: 0.018369436264038086

I have checked with https://github.com/NVIDIA/thrust/blob/main/CHANGELOG.md ,
async::copy and other async algorithms are enabled since Thrust 1.9.4 (CUDA Toolkit 10.1)

This policy can be easily applied to all codes that uses these algorithms.

thrust::async::reduce.
thrust::async::reduce_into, which takes a target location to store the reduction result into.
thrust::async::copy, including a two-policy overload that allows explicit cross system copies which execution policy properties can be attached to.
thrust::async::transform.
thrust::async::for_each.
thrust::async::stable_sort.
thrust::async::sort.

Great! for jetpack 4.4.1, the version is 1.9.7-1 CUDA Toolkit 10.2 for Tegra