Execution Trace Correlation Support
briancoutinho opened this issue ยท 0 comments
๐ Motivation and context
Chakra Execution Traces is an open and interoperable graph-based representation of AI/ML workloads focused on enabling and accelerating AI SW/HW co-design. Chakra execution traces represent key operations, such as compute, memory, and communication, data and control dependencies, timing, and resource constraints. Additionally, Chakra includes a complementary set of tools and capabilities to enable the collection, analysis, generation, and adoption of Chakra ETs by a broad range of simulators, emulators, and replay tools.
Correlating Execution Trace with PyTorch timeline traces will lead to an enriched trace data structure containing
- Detailed operator input/output tensor information (from ET).
- Dependency edges between operators and modules (from ET).
- Timeline (start, duration) information of PyTorch framework as well as GPU kernels (from Kineto).
This unlocks work like critical path analysis, estimation of efficiency improvements for anti-pattern detection, better operator input/output details etc.
Description
We can start correlating Execution Trace and Kineto Trace for single rank.
There are two possible cases for correlation
- ET and Kineto trace have overlap i.e collected together. This can be easily handled using record function ID ('rf_id') field.
- ET and Kineto are from different times. To correlate here we need to use a tree correlation algorithm. Possible implementation for this already exists in param #PR79
Setup
We propose adding param as a third party dependency for this project, this will import the Execution trace parsing datastructures etc.
Alternatives
Additional context
No response