facebookresearch/HolisticTraceAnalysis

New Trace Callstack processing out of order

briancoutinho opened this issue ยท 0 comments

๐Ÿ› Describe the bug

While doing critical path analysis- that builds a DAG of operations in the trace - I am ending up with cycles in the DAG.
This is how CPA works-

  1. Parse trace call stack.
  2. Create a node for start and node for end of every CPU operator
  3. Connect the nodes as seen in stack order

See https://hta.readthedocs.io/en/latest/source/features/lightweight_critical_path_analysis.html#cpu-operator-nesting-and-dependencies

I saw that the call stack group was providing stacks out of order. See how index 102534 comes in but it is actually older node

2024-03-14 01:11:14,541 - hta - critical_path_analysis.py:L401 - INFO - ===Exiting node TBackward0, id = 127107.0
2024-03-14 01:11:14,541 - hta - critical_path_analysis.py:L167 - INFO - Adding an edge between nodes 156119 -> 156117 type = CPEdgeType.OPERATOR_KERNEL
2024-03-14 01:11:14,542 - hta - critical_path_analysis.py:L401 - INFO - ==Exiting node autograd::engine::evaluate_function: TBackward0, id = 127106
2024-03-14 01:11:14,542 - hta - critical_path_analysis.py:L167 - INFO - Adding an edge between nodes 156117 -> 156115 type = CPEdgeType.OPERATOR_KERNEL
2024-03-14 01:11:14,542 - hta - critical_path_analysis.py:L370 - INFO - ==Entering node autograd::engine::evaluate_function: TBackward0, id = 102534
2024-03-14 01:11:14,542 - hta - critical_path_analysis.py:L167 - INFO - Adding an edge between nodes 156115 -> 156122 type = CPEdgeType.DEPENDENCY

Screenshot 2024-03-14 at 10 17 10โ€ฏAM

When we run critical path analysis it errors out due to cycles in the DAG

2024-03-13 01:20:39,328 - hta - critical_path_analysis.py:L867 - ERROR - Critical path algorithm failed due to Graph contains a cycle or graph changed during iteration

This is related to new callstack implementation from
#86

When i turned back to older call stack this does not happen, and it passes :)

For more context T182236796

Steps to reproduce

Download trace

manifold getr amoghavs/tree/overlaid_traces/fb_fm_330x/

We can run this script

from hta.configs.config import setup_logger
setup_logger()

from hta.trace_analysis import TraceAnalysis

trace_dir = "/Users/bcoutinho/Work/hta/critical_path/amogha_debug/fb_fm_330x"
analyzer = TraceAnalysis(trace_dir=trace_dir)

annotation = "ProfilerStep"
instance_id = 2
rank = 35


cp_graph, success = analyzer.critical_path_analysis(
    rank = rank, annotation=annotation, instance_id=instance_id)
print("success = ", success)

# dump overlaid trace
analyzer.overlay_critical_path_analysis(
    rank, cp_graph, only_show_critical_events=False, output_dir=trace_dir + '/overlaid', show_all_edges=True)

Expected behavior

Critical path analysis should pass.
For timebeing I am working around this by using older call stack implmentation.

Environment

OS Mac
Python 3.18
HTA version 4222b7b

Additional Info

No response