facebookresearch/HolisticTraceAnalysis

Critical path analysis - matching the kernel related to cudaEventRecord with stream

briancoutinho opened this issue · 3 comments

🐛 Describe the bug

cudaEventRecord(cudaEvent_t event, cudaStream_t stream)) will record the event on the specified stream, it’s not necessary that the preceding kernel was scheduled to the same stream. For example, the latest preceding kernel was scheduled to stream 3 but the following cudaEventRecord was to record the event on stream 1, then the right kernel is to be dependent on (by the next kernel on stream2) should be the last kernel on stream 1, rather than stream 3.

We need to update logic in HTA to not only find previous launch but check its cuda stream too. https://github.com/facebookresearch/HolisticTraceAnalysis/blob/main/hta/analyzers/critical_path_analysis.py#L436-L439

Steps to reproduce

We can use a small test program like this

with torch.cuda.stream(stream1): 
     result = torch.matmul(matrix_a, matrix_b)

with torch.cuda.stream(stream3): 
    result = torch.matmul(matrix_a, matrix_b)

event.record(stream1)
event.wait(stream2)

with torch.cuda.stream(stream2): 
     result1 = torch.matmul(matrix_c, result)

Expected behavior

Event record should be associated with stream1 matmul.

Environment

N/A

Additional Info

No response

@briancoutinho thanks for raising this issue.

When I was looking into the cuda_record_calls df for our event_sync_trace.json.gz , it is evident that even the full trace df does not capture the stream with which this event is recorded.

index cat name pid tid ts dur stream correlation bytes memory_bw_gbps wait_on_stream wait_on_cuda_event_record_corr_id external_id index_correlation iteration end
31 31 2 11 948300 948300 3017 3 -1 1522 -1 -1 -1 -1 1522 0 100 1.70742e+15
37 37 2 11 948300 948300 3041 3 -1 1533 -1 -1 -1 -1 1533 0 100 1.70742e+15

Also, the raw trace json does not seem to have any useful information about the stream on which this event is called

   {
      "ph": "X",
      "cat": "cuda_runtime",
      "name": "cudaEventRecord",
      "pid": 948300,
      "tid": 948300,
      "ts": 1707417525512352,
      "dur": 3,
      "args": {
        "External id": 1522,
        "cbid": 135,
        "correlation": 1522
      }
    },

Additionally, it seems like CUPI Activity API (which is used by kineto method) does not support/capture such information. Maybe I am missing something?

@kvignesh1420 yea this is a bit complicated because CUPTI is really not telling which stream the event is recorded in the CPU event. It does however tell us on the GPU side but that event does not have timing. I'll post a comment showing output that will make this clear :)
So one way we can figure out the stream an event is recorded on is by looking at future Sync events, these events have both cudaEventRecord() correlation ID, and the stream.

See "wait_on_cuda_event_record_corr_id" and "wait_on_stream"
Screenshot 2024-04-11 at 2 09 59 PM

@kvignesh1420 @itsrakeshk , @gajjanag
Here is a simple reproduction of the issue = full code
An outline of what the code does

# run kernel 1 on stream1 (stream = 20)
with torch.cuda.stream(stream1): 
    result = torch.matmul(matrix_a, matrix_b)

# run kernel 2 on stream3 (stream = 28)
with torch.cuda.stream(stream3): 
   result = torch.matmul(matrix_a, matrix_b)

# record an event on stream1 / stream = 20.
event.record(stream1)
# make kernels on stream 2 wait on the above event.
event.wait(stream2)

# run a kernel3  on stream 2
with torch.cuda.stream(stream2): 
    result1 = torch.matmul(matrix_c, result)

I enabled some prints in Kineto/PyTorch profiler so we can see this in action (full log here, I removed the eventid=9 as it is not relevant)

# kernel 1 on stream 20
INFO:2024-04-11 13:30:02 3727853:3727853 CuptiActivityProfiler.cpp:665] 25: Memset (Device)
INFO:2024-04-11 13:30:02 3727853:3727853 CuptiActivityProfiler.cpp:665] 27: ampere_sgemm_128x64_nn

# kernel 2 on stream 28
INFO:2024-04-11 13:30:02 3727853:3727853 CuptiActivityProfiler.cpp:665] 55: Memset (Device)
INFO:2024-04-11 13:30:02 3727853:3727853 CuptiActivityProfiler.cpp:665] 57: ampere_sgemm_128x64_nn

# The event id = 19 is recored on stream 20
INFO:2024-04-11 13:30:02 3727853:3727853 CuptiActivityProfiler.cpp:526] : CUPTI_ACTIVITY_KIND_CUDA_EVENT corrId=1385 eventId=19 streamId=20 contextId=1

# The stream Wait event on  stream 24 
INFO:2024-04-11 13:30:02 3727853:3727853 CuptiActivityProfiler.cpp:540] : CUPTI_ACTIVITY_KIND_SYNCHRONIZATION type=Stream Wait Event corrId=1389 streamId=24 eventId=19 contextId=1

# kernel 3 on stream 25
INFO:2024-04-11 13:30:02 3727853:3727853 CuptiActivityProfiler.cpp:665] 1411: Memset (Device)
INFO:2024-04-11 13:30:02 3727853:3727853 CuptiActivityProfiler.cpp:665] 1413: ampere_sgemm_128x64_nn

When we run the HTA Critical path analysis since we only look for the previous kernel launch before event record (after kernel 2 is launched) we are incorrectly adding dependency between the kernel 2 -> kernel 3

cp_before_fix