Add workaround for corrupted nsys GPU utilization data
Closed this issue · 3 comments
libcudf benchmarks run using nvbench show a conflict with Nsight Systems when collecting GPU utilization data. This issue is tracked in the Nsight Systems Jira board (Slack thread, Jira Issue).
The current consensus is that the root cause lies within Nsight Systems collects utilization data. I'm opening this issue to request that nvbench investigates a workaround. C++ google benchmarks and python pytest benchmarks have no issues collecting GPU utilization data with Nsight Systems, so there must be some way for nvbench user using the --profile
flag to access GPU utilization data.
The corresponding JIRA issue has been closed. @GregoryKimball Can this be closed?
Thank you @bdice for checking in. I'm sorry to say that there is a still a problem here at the intersection of nsys
and nvbench
.
When running this command on RAPIDS devel image 9ddc9c4c2046
B=JOIN_NVBENCH && /nfs/nsight-systems-2022.5.1/bin/nsys profile -t nvtx,cuda,osrt -f true --cuda-memory-usage=true --gpu-metrics-device=0 --output=/nfs/20230113_mixed_join/"$B" cpp/build/benchmarks/"$B" --devices 0 --profile --json /nfs/20230113_mixed_join/"$B".json | tee /nfs/20230113_mixed_join/"$B".txt
We receive this nsys diagnostics error instead of valid GPU Utilization metrics:
Error when processing events: Source ID=
Type=ErrorInformation (18)
Error information:
ProcessEventsError (4005)
Properties:
ErrorText (100)=/dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Host/Analysis/EventHandler/GpuMetricsEventHandler.cpp(202): Throw in function void QuadDAnalysis::EventHandler::GpuMetricsEventHandler::PutEvent(QuadDAnalysis::EventHandler::GpuMetricsEventHandler::EventPtr)
Dynamic exception type: boost::wrapexcept
std::exception::what: ChronologicalOrderError
[QuadDCommon::tag_message*] = GPU Metrics event chronological order was broken.
...
Error Daemon 00:32.551
GPU Metrics [0]: Sampling buffer overflow.
Error Daemon 00:46.632
GPU Metrics [0]: Sampling buffer overflow.
Error Daemon 00:55.676
GPU Metrics [0]: Sampling buffer overflow.
Error Daemon 01:03.454
GPU Metrics [0]: Sampling buffer overflow.
Error Daemon 01:17.560
GPU Metrics [0]: Sampling buffer overflow.
Error Daemon 01:26.606
GPU Metrics [0]: Sampling buffer overflow.
Error Daemon 01:34.367
GPU Metrics [0]: Sampling buffer overflow.
Error Daemon 01:38.231
GPU Metrics [0]: Sampling buffer overflow.
As of cudf 23.02, we find the "GPU Metrics event chronological order was broken" error for every libcudf microbenchmark that uses nvbench
including:
'GROUPBY_NVBENCH',
'JOIN_NVBENCH',
'PARQUET_READER_NVBENCH',
'PARQUET_WRITER_NVBENCH',
'REDUCTION_NVBENCH',
'SORT_NVBENCH',
'STREAM_COMPACTION_NVBENCH'
The error only occurs with nvbench
, and never with google benchmarks or pytest, so I still think we need an option in nvbench
to prevent corrupting GPU Metrics event data. Perhaps solving #100 will also address this. I am hoping that the workaround would take the form of a "cuda safe mode" that makes nvbench
act like a GPU-naive benchmarking tool.
Closed by rapidsai/cudf#12728