NVIDIA/nvbench

Add workaround for corrupted nsys GPU utilization data

Closed this issue · 3 comments

libcudf benchmarks run using nvbench show a conflict with Nsight Systems when collecting GPU utilization data. This issue is tracked in the Nsight Systems Jira board (Slack thread, Jira Issue).

The current consensus is that the root cause lies within Nsight Systems collects utilization data. I'm opening this issue to request that nvbench investigates a workaround. C++ google benchmarks and python pytest benchmarks have no issues collecting GPU utilization data with Nsight Systems, so there must be some way for nvbench user using the --profile flag to access GPU utilization data.

Reference profile with nvbench:
image

Reference profile with gbench:
image

bdice commented

The corresponding JIRA issue has been closed. @GregoryKimball Can this be closed?

Thank you @bdice for checking in. I'm sorry to say that there is a still a problem here at the intersection of nsys and nvbench.

When running this command on RAPIDS devel image 9ddc9c4c2046

B=JOIN_NVBENCH && /nfs/nsight-systems-2022.5.1/bin/nsys profile -t nvtx,cuda,osrt -f true --cuda-memory-usage=true --gpu-metrics-device=0 --output=/nfs/20230113_mixed_join/"$B" cpp/build/benchmarks/"$B" --devices 0 --profile --json /nfs/20230113_mixed_join/"$B".json | tee /nfs/20230113_mixed_join/"$B".txt

We receive this nsys diagnostics error instead of valid GPU Utilization metrics:

Error when processing events: Source ID=
Type=ErrorInformation (18)
 Error information:
 ProcessEventsError (4005)
  Properties:
  ErrorText (100)=/dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Host/Analysis/EventHandler/GpuMetricsEventHandler.cpp(202): Throw in function void QuadDAnalysis::EventHandler::GpuMetricsEventHandler::PutEvent(QuadDAnalysis::EventHandler::GpuMetricsEventHandler::EventPtr)
Dynamic exception type: boost::wrapexcept
std::exception::what: ChronologicalOrderError
[QuadDCommon::tag_message*] = GPU Metrics event chronological order was broken.

...

Error	Daemon		00:32.551	
GPU Metrics [0]: Sampling buffer overflow.
Error	Daemon		00:46.632	
GPU Metrics [0]: Sampling buffer overflow.
Error	Daemon		00:55.676	
GPU Metrics [0]: Sampling buffer overflow.
Error	Daemon		01:03.454	
GPU Metrics [0]: Sampling buffer overflow.
Error	Daemon		01:17.560	
GPU Metrics [0]: Sampling buffer overflow.
Error	Daemon		01:26.606	
GPU Metrics [0]: Sampling buffer overflow.
Error	Daemon		01:34.367	
GPU Metrics [0]: Sampling buffer overflow.
Error	Daemon		01:38.231	
GPU Metrics [0]: Sampling buffer overflow.

As of cudf 23.02, we find the "GPU Metrics event chronological order was broken" error for every libcudf microbenchmark that uses nvbench including:

'GROUPBY_NVBENCH', 
'JOIN_NVBENCH', 
'PARQUET_READER_NVBENCH',
'PARQUET_WRITER_NVBENCH', 
'REDUCTION_NVBENCH', 
'SORT_NVBENCH',
'STREAM_COMPACTION_NVBENCH'

The error only occurs with nvbench, and never with google benchmarks or pytest, so I still think we need an option in nvbench to prevent corrupting GPU Metrics event data. Perhaps solving #100 will also address this. I am hoping that the workaround would take the form of a "cuda safe mode" that makes nvbench act like a GPU-naive benchmarking tool.