facebookresearch/HolisticTraceAnalysis

Updated timeline analysis

fengxizhou opened this issue ยท 0 comments

๐Ÿš€ Motivation and context

In trace analysis, a timeline visualizes a sequence of events occurring over a specific period of time. When analyzing the performance of ML training jobs with these execution traces, a timeline tool helps to detect patterns, trends, anomalies, and interactions between different system components during a model execution.

In the past several months, we have developed a new HTA timeline tool which is based on the Trace DataFrame and Trace Call Stack Graph data structures, this tool provides several new trace analysis features for interactive trace analysis, including:

  • Visualization of both CPU and GPU events
  • Multi-Trace Multi-Rank Trace Comparison
  • Customized Visualization for Multi-Rank Kernels
  • Alignment of NN Modules and CUDA Kernels
  • Flexible Event Filtering and Visualization

Description

We implement a set of utilities to provide flexible timeline plotting capabilities. In the implementation, we divide the timeline plotting into four steps:

  1. Prepare trace events for plotting by filtering a trace DataFrame, combining multiple DataFrame, or transforming oen DataFrame into another.
  2. Automatically detect the plotting setting according to the trace data such as what columns are available and what values are for certain columns.
  3. Prepare timeline events by converting the trace events into a DataFrame of timeline events.
  4. Plot the timeline to create the timeline figures.

We also implement a TimelinePlotSetting to allow users to customize how to plot the timeline and a Timeline class to provide more easy-to-use APIs.

Alternatives

No response

Additional context

No response