HTA expects rank to be specified in a trace file

Question

HTA expects rank to be specified in a trace file

lishen opened this issue 7 months ago · 3 comments

🐛 Describe the bug

tried to run a simple example to use HTA for the first time but ran into error when reading the trace analysis. here's the error message:

2024-02-21 12:43:20,242 - hta - trace.py:L389 - INFO - path_to/traces-1
2024-02-21 12:43:20,283 - hta - trace_file.py:L61 - ERROR - If the trace file does not have the rank specified in it, then add the following snippet key to the json files to use HTA; "distributedInfo": {"rank": 0}. If there are multiple traces files, then each file should have a unique rank value.
2024-02-21 12:43:20,283 - hta - trace_file.py:L92 - WARNING - There is no item in the rank to trace file map.
2024-02-21 12:43:20,284 - hta - trace.py:L535 - INFO - ranks=[]
2024-02-21 12:43:20,285 - hta - trace.py:L541 - ERROR - The list of ranks to be parsed is empty.

Steps to reproduce

first, i created a trace file:

import torch
from torch.profiler import (
    profile, schedule, 
    tensorboard_trace_handler, 
    ProfilerActivity)
import torchvision.models as models

model = models.resnet18().cuda()
inputs = torch.randn(5, 3, 224, 224).cuda()

tracing_schedule = schedule(skip_first=5, wait=5, warmup=2, active=2, repeat=1)
trace_handler = tensorboard_trace_handler(dir_name="./traces-1/", use_gzip=True)

with profile(
    activities = [ProfilerActivity.CPU, ProfilerActivity.CUDA],
    schedule = tracing_schedule,
    on_trace_ready = trace_handler,
    profile_memory = True,
    record_shapes = True,
    with_stack = True
) as prof:

    for idx in range(25):
        model(inputs)
        prof.step()

then i tried to read the trace:

from hta.trace_analysis import TraceAnalysis
trace_dir = "./traces-1/"
analyzer = TraceAnalysis(trace_dir=trace_dir)

Expected behavior

TraceAnalysis reads the trace file without error so that i can do something cool.

Environment

OS version: CentOS Linux release 7.9.2009
Python version: 3.9.1
PyTorch version: 2.2.0
torch-tb-profiler: 0.4.3
HTA version: 0.2.0
How did you installed HTA (pip, source): pip

Additional Info

No response

Answer 1 · 2024-02-22T03:05:55.000Z

HTA is primarily meant for distributed jobs. In this case, it appears you are using it for a single rank. The error message clearly states what to do in this case.

2024-02-21 12:43:20,283 - hta - trace_file.py:L61 - ERROR - If the trace file does not have the rank specified in it, then add the following snippet key to the json files to use HTA; "distributedInfo": {"rank": 0}. If there are multiple traces files, then each file should have a unique rank value.

Feel free to reopen the issue if this doesn't solve the problem.

Answer 2 · 2024-02-22T16:03:36.000Z

yes, i realized that after i created the issue. however, it's natural for a beginner to start from a simple example. it's frustrating when even a simple example didn't work and the documentation doesn't say anything about it. it would be nice if you can make HTA work for traces that are generated from non-distributed jobs as well.

Answer 3 · 2024-02-22T18:25:26.000Z

Thanks for the feedback. This is not an issue with HTA but due to the fact that the PyTorch Profiler does not contain rank info in the trace when it profiles a single rank job.