facebookresearch/HolisticTraceAnalysis

Execution Trace Correlation Support

briancoutinho opened this issue ยท 0 comments

๐Ÿš€ Motivation and context

Chakra Execution Traces is an open and interoperable graph-based representation of AI/ML workloads focused on enabling and accelerating AI SW/HW co-design. Chakra execution traces represent key operations, such as compute, memory, and communication, data and control dependencies, timing, and resource constraints. Additionally, Chakra includes a complementary set of tools and capabilities to enable the collection, analysis, generation, and adoption of Chakra ETs by a broad range of simulators, emulators, and replay tools.

Correlating Execution Trace with PyTorch timeline traces will lead to an enriched trace data structure containing

  1. Detailed operator input/output tensor information (from ET).
  2. Dependency edges between operators and modules (from ET).
  3. Timeline (start, duration) information of PyTorch framework as well as GPU kernels (from Kineto).

This unlocks work like critical path analysis, estimation of efficiency improvements for anti-pattern detection, better operator input/output details etc.

Description

We can start correlating Execution Trace and Kineto Trace for single rank.
There are two possible cases for correlation

  1. ET and Kineto trace have overlap i.e collected together. This can be easily handled using record function ID ('rf_id') field.
  2. ET and Kineto are from different times. To correlate here we need to use a tree correlation algorithm. Possible implementation for this already exists in param #PR79

Setup

We propose adding param as a third party dependency for this project, this will import the Execution trace parsing datastructures etc.

Alternatives

Additional context

No response