Unable to gather profile data, but timeline works - HSA_STATUS_ERROR_OUT_OF_RESOURCES
JMadgwick opened this issue · 4 comments
I have a problem where I receive an error when trying to profile my program. I am using the latest codeXL release .deb on Ubuntu 18.04.
The program uses HC (compiled using HCC), I have the latest Rocm installed.
In the project setting I have collect counters for HSA selected. The only indication of a problem is red text under "Performance Counters Selection" which states that they are not available and that they are only available with the Catalyst driver.
What does this mean? I can't use the closed source driver because (as far as I know) it doesn't support HC/HIP.
The exact error given when I try to profile is:
Unable to gather profile data. This error can occur for one of several reasons:
The active project is not an HSA program.
The active project is an HSA program, but it did not enqueue any kernels.
The active project is an HSA program, but it did not enqueue any kernels listed in the Profile Specific Kernels section.
The active project does not compile or run properly (try running it manually).
You do not have write access to the profile output directory.
None of those reasons are fulfilled. The Project is a HSA program and I can see the HSA calls in the timeline. I tried to run as root to see if write access was a problem but the error remained.
When running in the terminal I see this:
HCC STATUS_CHECK Error: HSA_STATUS_ERROR_OUT_OF_RESOURCES (0x1008) at file:mcwamp_hsa.cpp line:1213
Failed to generate profile result /home/james/.CodeXL/CodeXL/gpubbp_ProfilerOutput/Mar-06-2019_15-24/Mar-06-2019_15-24.occupancy.
Failed to generate profile result /home/james/.CodeXL/CodeXL/gpubbp_ProfilerOutput/Mar-06-2019_15-24/Mar-06-2019_15-24.csv.
This looks like the real error and nothing to do with the GUI message!
I don't know what could be causing that HCC error but I don't think it's related to my program because it occurs just the same if I try to profile the example HC saxpy program.
I've had a look into HSA_STATUS_ERROR_OUT_OF_RESOURCES but I couldn't find much. I did find some issue where it seemed to relate to pcie atomics. Therefore, I thought I should add that I am using Vega20 (gfx906) but I am not using a CPU that supports atomics. I am using a xeon E3-1230v2 and the GPU is connected directly (pcie 3.0 x16).
If this is related to the problem that its odd that the program runs fine but can't be profiled.
The profiler backend included in the most recent CodeXL release does not support Vega20 (gfx906). It also may not be compatible with the most recent ROCm releases. I'm not certain that the error you're seeing is due to these facts or if something else might be going on.
What I would suggest would be to try the version of RCP (the profiler backend) delivered with ROCm. It can be installed on a ROCm system with the following command: "sudo apt install rocm-profiler".
Once it is installed, you can run it with the following command to collect perf counters:
/opt/rocm/profiler/bin/rcprof --hsapmc <application_command_line>
If successful, you'll end up with a Session1.csv file in your home directory.
If this works, you should be able to replace the files in the CodeXL dir with the profiler files from /opt/rocm/profiler/bin, and then profiling from the CodeXL UI should work.
Thanks, I have done as you said and it is now working as expected.
However it didn't work at first because I was getting this:
Insufficient privileges. Either re-run as root or modify the permissions on
/sys/class/drm/card0/device/power_dpm_force_performance_level
to give the current user write access.
But running CodeXL with sudo allows it to work and I can now see the profile.
However I didn't need sudo to run /opt/rocm/profiler/bin/rcprof --hsapmc?
Also I would like to be able to view the intermediate language or ISA. But the kernel name is not blue like it is in the Quick Start Guide and it doesn't seem to be accessible. Do I need to replace some other files to get this working too?
Also I see this in the terminal output:
[0306/172556.263680:ERROR:nss_util.cc(94)] Failed to create /home/james/.pki/nssdb directory.
Looks like this problem was pretty much because of:
the most recent CodeXL release does not support Vega20 (gfx906)
And so I presume this will not be a problem with the next version. Therefore closing.