spectralcode/OCTproZ

Possible performance improvement for Jetson Nano by using zero copy

Closed this issue · 1 comments

On the Jetson Nano, CPU memory and GPU memory share the same physical SoC DRAM.
Instead of using cudaMemcpyAsync, utilizing zero copy might enhance performance.

See:

Zero-copy appears to slightly improve performance.

(This version of the code was used: https://github.com/spectralcode/OCTproZ/tree/98656b84144ad83c9a6033c3ed2d936d0a9d683e with virtual OCT system 12 bit, 1024 Samples per raw A-scan, 512, A-scans per B-scan, 32-Bscans per buffer)

Left: zero-copy enabled | Right: zero-copy disabled

jetson_nano_zero_copy_enabled_disabled_visual_profiler

Zoomed in (left: zero-copy enabled, right: disabled):

jetson_nano_zero_copy_enabled_disabled_visual_profiler_zoomed

To make this work, I moved the synchronization event to the end of the processing pipeline. Otherwise, the visual profiler diagram looks scattered, and the OCT images flicker. This should be investigated in more detail at some point in the future. Perhaps synchronization with the host could be improved by using cudaLaunchHostFunc to tell the host that the previous buffer was copied/used and can be reused. For now, this works, and I’ll merge it into the main branch.

The visual profiler files can be downloaded here for closer inspection:
20241103_nano_visual_profiler_zero_copy.zip