New Trace Callstack processing out of order
briancoutinho opened this issue ยท 0 comments
briancoutinho commented
๐ Describe the bug
While doing critical path analysis- that builds a DAG of operations in the trace - I am ending up with cycles in the DAG.
This is how CPA works-
- Parse trace call stack.
- Create a node for start and node for end of every CPU operator
- Connect the nodes as seen in stack order
I saw that the call stack group was providing stacks out of order. See how index 102534 comes in but it is actually older node
2024-03-14 01:11:14,541 - hta - critical_path_analysis.py:L401 - INFO - ===Exiting node TBackward0, id = 127107.0
2024-03-14 01:11:14,541 - hta - critical_path_analysis.py:L167 - INFO - Adding an edge between nodes 156119 -> 156117 type = CPEdgeType.OPERATOR_KERNEL
2024-03-14 01:11:14,542 - hta - critical_path_analysis.py:L401 - INFO - ==Exiting node autograd::engine::evaluate_function: TBackward0, id = 127106
2024-03-14 01:11:14,542 - hta - critical_path_analysis.py:L167 - INFO - Adding an edge between nodes 156117 -> 156115 type = CPEdgeType.OPERATOR_KERNEL
2024-03-14 01:11:14,542 - hta - critical_path_analysis.py:L370 - INFO - ==Entering node autograd::engine::evaluate_function: TBackward0, id = 102534
2024-03-14 01:11:14,542 - hta - critical_path_analysis.py:L167 - INFO - Adding an edge between nodes 156115 -> 156122 type = CPEdgeType.DEPENDENCY
When we run critical path analysis it errors out due to cycles in the DAG
2024-03-13 01:20:39,328 - hta - critical_path_analysis.py:L867 - ERROR - Critical path algorithm failed due to Graph contains a cycle or graph changed during iteration
This is related to new callstack implementation from
#86
When i turned back to older call stack this does not happen, and it passes :)
For more context T182236796
Steps to reproduce
Download trace
manifold getr amoghavs/tree/overlaid_traces/fb_fm_330x/
We can run this script
from hta.configs.config import setup_logger
setup_logger()
from hta.trace_analysis import TraceAnalysis
trace_dir = "/Users/bcoutinho/Work/hta/critical_path/amogha_debug/fb_fm_330x"
analyzer = TraceAnalysis(trace_dir=trace_dir)
annotation = "ProfilerStep"
instance_id = 2
rank = 35
cp_graph, success = analyzer.critical_path_analysis(
rank = rank, annotation=annotation, instance_id=instance_id)
print("success = ", success)
# dump overlaid trace
analyzer.overlay_critical_path_analysis(
rank, cp_graph, only_show_critical_events=False, output_dir=trace_dir + '/overlaid', show_all_edges=True)
Expected behavior
Critical path analysis should pass.
For timebeing I am working around this by using older call stack implmentation.
Environment
OS Mac
Python 3.18
HTA version 4222b7b
Additional Info
No response