Possible performance improvement for Jetson Nano by using zero copy
Closed this issue · 1 comments
On the Jetson Nano, CPU memory and GPU memory share the same physical SoC DRAM.
Instead of using cudaMemcpyAsync, utilizing zero copy might enhance performance.
See:
Zero-copy appears to slightly improve performance.
(This version of the code was used: https://github.com/spectralcode/OCTproZ/tree/98656b84144ad83c9a6033c3ed2d936d0a9d683e with virtual OCT system 12 bit, 1024 Samples per raw A-scan, 512, A-scans per B-scan, 32-Bscans per buffer)
Left: zero-copy enabled | Right: zero-copy disabled
Zoomed in (left: zero-copy enabled, right: disabled):
To make this work, I moved the synchronization event to the end of the processing pipeline. Otherwise, the visual profiler diagram looks scattered, and the OCT images flicker. This should be investigated in more detail at some point in the future. Perhaps synchronization with the host could be improved by using cudaLaunchHostFunc
to tell the host that the previous buffer was copied/used and can be reused. For now, this works, and I’ll merge it into the main branch.
The visual profiler files can be downloaded here for closer inspection:
20241103_nano_visual_profiler_zero_copy.zip