facebookresearch/HolisticTraceAnalysis

CUPTI Profiler Counter analysis in HTA

briancoutinho opened this issue ยท 3 comments

๐Ÿš€ Motivation and context

Performance counters measured on the GPU kernels can provide insights on how to speed up GPU kernels, conduct roofline analysis and other low level optimizations. Profiling tools like NSight Compute provide the ability achieve this interactively but they do not work well on remote application, jobs running on a cluster etc.

PyTorch profiler has an alternative lightweight API that gives uses CUPTI Range Profiler API to program and measure detailed performance counters from the device. The underlying mechanism is similar to what NSight uses but this solution is easier to deploy. For example, the application does not have to be launched with NSight compute. Also it supports the same list of performance metrics as NSight. Please see this PR for more details

Performance measurements are emitted to the trace either per kernel or for the entire performance profiling region.

Description

Trace Output Walkthrough

When the CUPTI Profiler mode is enabled the PyTorch trace will contain the performance measurement values annotated in the GPU kernel events.

  • The events are emitted under a cuda_profiler_range category
  • The counter values are contained inside the args json part of the output. See smsp__saas* events below.
  {
    "ph": "X", "cat": "cuda_profiler_range", "name": "void at::native::vectorized_elementwise_kernel<4, at::native::(anonymous namespace)::launch_clamp_scalar(....", "pid": 0, "tid": 0,
    "ts": 1675195558492698, "dur": 115109,
    "args": {
      "smsp__sass_thread_inst_executed_op_dadd_pred_on.sum": 0, "smsp__sass_thread_inst_executed_op_dfma_pred_on.sum": 0, "smsp__sass_thread_inst_executed_op_dmul_pred_on.sum": 0, "smsp__sass_thread_inst_executed_op_hadd_pred_on.sum": 0, "smsp__sass_thread_inst_executed_op_hfma_pred_on.sum": 0, "smsp__sass_thread_inst_executed_op_hmul_pred_on.sum": 0, "smsp__sass_thread_inst_executed_op_fadd_pred_on.sum": 256, "smsp__sass_thread_inst_executed_op_ffma_pred_on.sum": 0, "smsp__sass_thread_inst_executed_op_fmul_pred_on.sum": 32
    }
  },

The CPU operators continue to be emitted as usual.

Correlating CPU operators with GPU kernel measurements

The goal of the analysis is to correlate the CPU operations with the GPU kernel measurement. To achieve this we use the following strategy

  1. Line up the cudaKernelLaunch events with GPU kernel measurements on each CUDA stream. Typically we would use correlation_id to connect GPU kernels and CPU kernel launches but this is not available in this mode.
  2. We then use the operator stack feature of HTA to tie up the relationship between the kernels and the CPU operators.

Expected Output and API

Users can run a new kind of analysis CounterAnalysis on traces that returns a dataframe as follows
Each row corresponds to one GPU kernel with -

  • kernel_name column for the GPU kernel.
  • columns corresponding to performance counter events in the trace.
  • op_stack an array of the correlated cpu operators.
  • bottom_level_op and top_level_op are picking the top and bottom of the operator stack for convenience.

A first cut function signature could be-

    def get_counter_data_with_operators(
        cls,
        t: "Trace",
        ranks: Optional[List[int]] = None,
    ) -> List[pd.DataFrame]:

How will this be used

By managing all the results in a dataframe we have all the benefits of combining different performance counters.
For example, we can combine instruction counters to compute floating point operations/sec or flops

gpu_kernels = analyzer.get_counter_data_with_operators()[0]

CUDA_SASS_INSTRUCTION_COUNTER_FLOPS = {
    f"smsp__sass_thread_inst_executed_op_{op}_pred_on.sum": (2 if "fma" in op else 1)
    for op in ["ffma", "fmul", "fadd", "hfma", "hmul", "hadd", "dfma", "dmul", "dadd"]
}

gpu_kernels["flops"] = 0
for counter, flops in CUDA_SASS_INSTRUCTION_COUNTER_FLOPS.items():
    gpu_kernels["flops"] += gpu_kernels[counter] * flops

Alternatives

We could implement this analysis separately but that would mean duplicating the operator stack feature in HTA. Doing this inside HTA does have a benefit of seamlessly integrating this with downstream tools.

Additional context

Here is a screenshot of example analysis in a notebook.
Screenshot 2023-03-27 at 1 58 46 PM

Thanks, @briancoutinho. This is a very useful feature.

To clarify, collecting these performance counters is independent from the PyTorch/Kineto trace but the kernels use the same naming schema except that there is no correlation_id field in a kernel event's args attributes.

To clarify, collecting these performance counters is independent from the PyTorch/Kineto trace but the kernels use the same naming schema except that there is no correlation_id field in a kernel event's args attributes.

Yep this is independent, so it is basically a different mode in Kineto/Pytorch. You can collect this by adding the list of metrics using the experimental config

    with torch.profiler.profile(
        activities=[torch.profiler.ProfilerActivity.CUDA],
        record_shapes=True,
        on_trace_ready=trace_handler,
        experimental_config=torch.profiler._ExperimentalConfig(
            profiler_metrics=[
                "kineto__tensor_core_insts",
                "dram__bytes_read.sum",
                "dram__bytes_write.sum"],
            profiler_measure_per_kernel=True),
    ) as prof:
        res = train_batch(modeldef)
        prof.step()```

This would be a great addition to HTA @briancoutinho. Looking forward to it!