StanfordLegion/legion

Realm Support for GPU Kernel Profiling

Closed this issue · 3 comments

Today Realm provides a profiling measurement for bounding the activity of all asynchronous work that a GPU task performs on a GPU.

https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/profiling.h?ref_type=heads#L127-147

While this can be useful for putting bounds on how long work took on the GPU, it's not actually precise. The reason for this is that a GPU task might launch multiple kernels or other asynchronous operations (e.g. memcpy) from the task and they might be interleaved by the GPU driver with kernels from other GPU tasks, meaning the GPU isn't busy running just kernels from one GPU task at a time. It would be good if Realm could provide profiling feedback about individual kernels and other asynchronous operations that were performed inside of a GPU task and when they actually ran on the GPU so that we can accurately represent that information to mappers and to the Legion profiler. It seems like this might be possible to do with the CUPTI interface in CUDA, but it's unclear what kinds of overheads it might incur. It also seems to be a global setting so you might have to pay for it all the time even if many GPU tasks don't actually request the specific kind of kernel profiling measurement from Realm. Some exploration should be done to determine if this is even a reasonable path before actually embarking on it.

Assigning @apryakhin to triage for now. This is a low-priority enhancement.

muraj commented

Closing this issue as a duplicate of #1732 which has more traction.

I'm not sure that #1732 actually supersedes this issue?

My understanding is that #1732 is about making the bounding box around the GPU kernels of a task tighter. But you still fundamentally get one box.

This issue is about accurately representing multiple boxes, one per kernel. Obviously this is of no use if it's not at least as precise as #1732, but fundamentally it's a different problem to solve. And (I suspect) an open question whether we want solve it, because it could potentially have dramatically higher overheads.

I have mixed feelings. One one hand I don't think Realm should be in the business of duplicating the functionality of Nsight, but at the same time, I think there might be some value in at least getting individual kernel profilings in the Legion Prof profile. I'd be inclined to reopen this issue and just let it remain open for a while in case any important use cases pop up.