neka-nat/cupoch

[QST]About the best practice of memory copy.

ZhenshengLee opened this issue ยท 7 comments

Hi everyone,

I am trying to find the best way to copy points between host and device, and I found several methods in cupoch.
So which method is of best performance(in time comsuming)

thrust copy constructable function

may not be the best

// d2h
auto pointcloud_points_host = pointcloud->GetPoints();
// h2d
void PointCloud::SetPoints(const thrust::host_vector<Eigen::Vector3f> &points) {
    points_ = points;
}

memcpy with thrust raw pointer

in src/cupoch/io/class_io/pointcloud_io.cu

void HostPointCloud::FromDevice(const geometry::PointCloud& pointcloud) {
    points_.resize(pointcloud.points_.size());
    normals_.resize(pointcloud.normals_.size());
    colors_.resize(pointcloud.colors_.size());
    cudaSafeCall(cudaMemcpy(points_.data(), thrust::raw_pointer_cast(pointcloud.points_.data()),
                            points_.size() * sizeof(Eigen::Vector3f), cudaMemcpyDeviceToHost));
    cudaSafeCall(cudaMemcpy(normals_.data(), thrust::raw_pointer_cast(pointcloud.normals_.data()),
                            normals_.size() * sizeof(Eigen::Vector3f), cudaMemcpyDeviceToHost));
    cudaSafeCall(cudaMemcpy(colors_.data(), thrust::raw_pointer_cast(pointcloud.colors_.data()),
                            colors_.size() * sizeof(Eigen::Vector3f), cudaMemcpyDeviceToHost));
}

void HostPointCloud::ToDevice(geometry::PointCloud& pointcloud) const {
    pointcloud.points_.resize(points_.size());
    pointcloud.normals_.resize(normals_.size());
    pointcloud.colors_.resize(colors_.size());
    cudaSafeCall(cudaMemcpy(thrust::raw_pointer_cast(pointcloud.points_.data()), points_.data(),
                            points_.size() * sizeof(Eigen::Vector3f), cudaMemcpyHostToDevice));
    cudaSafeCall(cudaMemcpy(thrust::raw_pointer_cast(pointcloud.normals_.data()), normals_.data(),
                            normals_.size() * sizeof(Eigen::Vector3f), cudaMemcpyHostToDevice));
    cudaSafeCall(cudaMemcpy(thrust::raw_pointer_cast(pointcloud.colors_.data()), colors_.data(),
                            colors_.size() * sizeof(Eigen::Vector3f), cudaMemcpyHostToDevice));
}

thrust::copy (even with cudastream)

src/cupoch/geometry/down_sample.cu

thrust::copy(utility::exec_policy(utility::GetStream(0))
                         ->on(utility::GetStream(0)),
                 range_points.begin(), range_points.end(),
                 output->points_.begin());
    if (has_normals) {
        thrust::strided_range<
                utility::device_vector<Eigen::Vector3f>::const_iterator>
                range_normals(normals_.begin(), normals_.end(), every_k_points);
        thrust::copy(utility::exec_policy(utility::GetStream(1))
                             ->on(utility::GetStream(1)),
                     range_normals.begin(), range_normals.end(),
                     output->normals_.begin());
    }
    if (has_colors) {
        thrust::strided_range<
                utility::device_vector<Eigen::Vector3f>::const_iterator>
                range_colors(colors_.begin(), colors_.end(), every_k_points);
        thrust::copy(utility::exec_policy(utility::GetStream(2))
                             ->on(utility::GetStream(2)),
                     range_colors.begin(), range_colors.end(),
                     output->colors_.begin());
    }
    cudaSafeCall(cudaDeviceSynchronize());

This is a Japanese article, and it says that cudamemcpy is faster.
http://www.sciement.com/tech-blog/c/various_binarizations2/

This is a Japanese article, and it says that cudamemcpy is faster.
http://www.sciement.com/tech-blog/c/various_binarizations2/

I can not visit that Japanese website and it doesn't matter.

My experiment shows that cudamemcpy is faster.

I also checked cudamemcpyasync with 3 streams to copy points, normals and colors, and the performance was not better than cudamemcpy. It bothers me a lot.

The copy between host and device is synchronized on the host, so separating the streams is not to be fast.
It is possible to speed up the process by starting the process from the data that has been copied.
https://developer.nvidia.com/blog/how-overlap-data-transfers-cuda-cc/

The copy between host and device is synchronized on the host, so separating the streams is not to be fast.
It is possible to speed up the process by starting the process from the data that has been copied.

Thanks for quick reply!

The project cuda-pcl(only for jetson devices) shows its memcopy method, which uses managed memory with cudastream.

cudaMallocManaged(&input, sizeof(float) * 4 * nCount, cudaMemAttachHost);
  cudaStreamAttachMemAsync (stream, input );
  cudaMemcpyAsync(input, inputData, sizeof(float) * 4 * nCount, cudaMemcpyHostToDevice, stream);
  cudaStreamSynchronize(stream);

as #62 and #60 said,

the pass-filter time somsuming using cupoch is 0.686242ms (only passfilter)

But with cuda-pcl that is 0.586943ms(with passfilter and copy device2host)

I think it may need more steps to improve the core performance of cupoch. Including memory copy and thrust STL functions.

I am trying my efforts to do so.

Thank you!

Sounds good!
In the case of jetson, it is faster to use managed memory.
Currently, it is possible to use managed memory in cupoch, but it requires implementation around copying.

In the case of jetson, it is faster to use managed memory.
Currently, it is possible to use managed memory in cupoch, but it requires implementation around copying.

It not only involves implementation around copying, but also mallocation.

Because managed memory involves cudaMallocManaged, you must use rmm::mr::thrust_allocator with rmm::managed_memory_resource allocator so that you can use it with pointcloud class.

In my test, the memcpy speeds up, but the thrust STL(eg. remove_if for passthrough filter) becomes slower

// use managed memory allocation to speed up memcpy
utility::InitializeAllocator(utility::CudaManagedMemory, 1000000000);

as #62 and #60 said,

  • x86 pc with qudro-p4000 gpu
  • PoolAllocation strategy
  • cuda-filter-cupoch context
  • filter 119978 points to 5510.

The passfilter function costs 0.096551ms

Originally posted by @ZhenshengLee in #60 (comment)

But with CudaManagedMemory strategy, passfilter costs 0.298ms

image

So there must be some issues with thrust STL with managed memory.

@neka-nat

The test in the xavier shows that it would be better to use default memory pool rather than managed memory pool in cupoch.

According to rapidsai/rmm#849

I think you are expecting managed memory access to have higher performance than device memory access on Jetson devices. RMM is a memory allocator -- the benchmark you are running measures allocation performance, not memory access performance. I would expect managed memory allocation to be similar to or slower than device memory allocation, which this benchmark confirms.

On Jetson, I would expect using managed memory to eliminate the need for memory copies, but not to make memory access faster. Jetson has a single physical memory shared by GPU and CPU. Eliminating copies is likely to be beneficial.

three part of time comsuming will be changed if using custom memory allocator.

  • allocation time
  • memory copy(access) time
  • algorithm(kernel) execution time

With managed memory allocation in Jetson devices with iGPU, the memory access time reduced a lot, in ZhenshengLee/perception_cupoch#9 but allocation time and kernel execution time increased a lot too, see this doc

So, it can be really complicated things to provide a unified best way to get the best performance.