CUPTI Profiler Counter analysis in HTA
briancoutinho opened this issue ยท 3 comments
๐ Motivation and context
Performance counters measured on the GPU kernels can provide insights on how to speed up GPU kernels, conduct roofline analysis and other low level optimizations. Profiling tools like NSight Compute provide the ability achieve this interactively but they do not work well on remote application, jobs running on a cluster etc.
PyTorch profiler has an alternative lightweight API that gives uses CUPTI Range Profiler API to program and measure detailed performance counters from the device. The underlying mechanism is similar to what NSight uses but this solution is easier to deploy. For example, the application does not have to be launched with NSight compute. Also it supports the same list of performance metrics as NSight. Please see this PR for more details
Performance measurements are emitted to the trace either per kernel or for the entire performance profiling region.
Description
Trace Output Walkthrough
When the CUPTI Profiler mode is enabled the PyTorch trace will contain the performance measurement values annotated in the GPU kernel events.
- The events are emitted under a
cuda_profiler_range
category - The counter values are contained inside the args json part of the output. See
smsp__saas*
events below.
{
"ph": "X", "cat": "cuda_profiler_range", "name": "void at::native::vectorized_elementwise_kernel<4, at::native::(anonymous namespace)::launch_clamp_scalar(....", "pid": 0, "tid": 0,
"ts": 1675195558492698, "dur": 115109,
"args": {
"smsp__sass_thread_inst_executed_op_dadd_pred_on.sum": 0, "smsp__sass_thread_inst_executed_op_dfma_pred_on.sum": 0, "smsp__sass_thread_inst_executed_op_dmul_pred_on.sum": 0, "smsp__sass_thread_inst_executed_op_hadd_pred_on.sum": 0, "smsp__sass_thread_inst_executed_op_hfma_pred_on.sum": 0, "smsp__sass_thread_inst_executed_op_hmul_pred_on.sum": 0, "smsp__sass_thread_inst_executed_op_fadd_pred_on.sum": 256, "smsp__sass_thread_inst_executed_op_ffma_pred_on.sum": 0, "smsp__sass_thread_inst_executed_op_fmul_pred_on.sum": 32
}
},
The CPU operators continue to be emitted as usual.
Correlating CPU operators with GPU kernel measurements
The goal of the analysis is to correlate the CPU operations with the GPU kernel measurement. To achieve this we use the following strategy
- Line up the
cudaKernelLaunch
events with GPU kernel measurements on each CUDA stream. Typically we would usecorrelation_id
to connect GPU kernels and CPU kernel launches but this is not available in this mode. - We then use the operator stack feature of HTA to tie up the relationship between the kernels and the CPU operators.
Expected Output and API
Users can run a new kind of analysis CounterAnalysis
on traces that returns a dataframe as follows
Each row corresponds to one GPU kernel with -
kernel_name
column for the GPU kernel.- columns corresponding to performance counter events in the trace.
op_stack
an array of the correlated cpu operators.bottom_level_op
andtop_level_op
are picking the top and bottom of the operator stack for convenience.
A first cut function signature could be-
def get_counter_data_with_operators(
cls,
t: "Trace",
ranks: Optional[List[int]] = None,
) -> List[pd.DataFrame]:
How will this be used
By managing all the results in a dataframe we have all the benefits of combining different performance counters.
For example, we can combine instruction counters to compute floating point operations/sec or flops
gpu_kernels = analyzer.get_counter_data_with_operators()[0]
CUDA_SASS_INSTRUCTION_COUNTER_FLOPS = {
f"smsp__sass_thread_inst_executed_op_{op}_pred_on.sum": (2 if "fma" in op else 1)
for op in ["ffma", "fmul", "fadd", "hfma", "hmul", "hadd", "dfma", "dmul", "dadd"]
}
gpu_kernels["flops"] = 0
for counter, flops in CUDA_SASS_INSTRUCTION_COUNTER_FLOPS.items():
gpu_kernels["flops"] += gpu_kernels[counter] * flops
Alternatives
We could implement this analysis separately but that would mean duplicating the operator stack feature in HTA. Doing this inside HTA does have a benefit of seamlessly integrating this with downstream tools.
Additional context
Thanks, @briancoutinho. This is a very useful feature.
To clarify, collecting these performance counters is independent from the PyTorch/Kineto trace but the kernels use the same naming schema except that there is no correlation_id field in a kernel event's args attributes.
To clarify, collecting these performance counters is independent from the PyTorch/Kineto trace but the kernels use the same naming schema except that there is no correlation_id field in a kernel event's args attributes.
Yep this is independent, so it is basically a different mode in Kineto/Pytorch. You can collect this by adding the list of metrics using the experimental config
with torch.profiler.profile(
activities=[torch.profiler.ProfilerActivity.CUDA],
record_shapes=True,
on_trace_ready=trace_handler,
experimental_config=torch.profiler._ExperimentalConfig(
profiler_metrics=[
"kineto__tensor_core_insts",
"dram__bytes_read.sum",
"dram__bytes_write.sum"],
profiler_measure_per_kernel=True),
) as prof:
res = train_batch(modeldef)
prof.step()```
This would be a great addition to HTA @briancoutinho. Looking forward to it!