StanfordLegion/legion

Hang at start up on multinode runs

Closed this issue · 23 comments

If HTR is compiled on the latest version of master, it hangs at startup when executed on multiple nodes.
I think that the top-level task does not even start its execution.
The backtraces obtained on a two-node execution are contained in the attached files
bt_0.log
bt_1.log

The backtraces are produced on sapling but the problem reproduces on every system that I have tried so far.

@elliottslaughter, can you please add this issue to #1032?

@artempriakhin @eddy16112 @muraj This suggests an issue with the DMA system at start-up.

Thanks! Mario, what was the latest successful commit without the hang? Was it on a control_replication branch or master branch after control_replication merge? Are you running one of your standard tests?

411fb72 works for sure.
I haven't had time to further bisect between the current head of master and that commit

It looks like it's waiting for the CUDA IPC active messages to complete. This was changed pretty recently to reduce the number of active messages sent on start up and to improve network scalability at init time. What was the commit that you saw the problem on? I don't see the commit in github's master, in gitlab the SHA is e0fc465

I am seeing the problem on b948d941b50d2bfdd01efa4f4eed5bac41b429b4

Is there a commit that I should try?

@mariodirenzo can you try just before e0fc465?

Yeah, then it's that commit. I need logs to understand what's going on here before I can make a change. Can you give me the output of -level gpu=1 -level cudaipc=1 ?

This is the output on one node

[0 - 7f4051dffc80]    0.000000 {2}{gpu}: dynamically loading libnvidia-ml.so
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: GPU #0: Tesla P100-SXM2-16GB (6.0) 16276 MB
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: GPU #1: Tesla P100-SXM2-16GB (6.0) 16276 MB
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: GPU #2: Tesla P100-SXM2-16GB (6.0) 16276 MB
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: GPU #3: Tesla P100-SXM2-16GB (6.0) 16276 MB
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: GPU #0 local memory: 366080 MB/s, 13 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: p2p access from device 0 to device 1 bandwidth: 25000 MB/s latency: 100 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: p2p access from device 0 to device 2 bandwidth: 25000 MB/s latency: 100 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: p2p access from device 0 to device 3 bandwidth: 50000 MB/s latency: 100 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: GPU #1 local memory: 366080 MB/s, 13 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: p2p access from device 1 to device 0 bandwidth: 25000 MB/s latency: 100 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: p2p access from device 1 to device 2 bandwidth: 50000 MB/s latency: 100 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: p2p access from device 1 to device 3 bandwidth: 25000 MB/s latency: 100 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: GPU #2 local memory: 366080 MB/s, 13 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: p2p access from device 2 to device 0 bandwidth: 25000 MB/s latency: 100 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: p2p access from device 2 to device 1 bandwidth: 50000 MB/s latency: 100 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: p2p access from device 2 to device 3 bandwidth: 25000 MB/s latency: 100 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: GPU #3 local memory: 366080 MB/s, 13 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: p2p access from device 3 to device 0 bandwidth: 50000 MB/s latency: 100 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: p2p access from device 3 to device 1 bandwidth: 25000 MB/s latency: 100 ns
[0 - 7f4051dffc80]    0.000000 {2}{gpu}: p2p access from device 3 to device 2 bandwidth: 25000 MB/s latency: 100 ns
[0 - 7f4051dffc80]    0.001506 {4}{threads}: reservation ('OMP1 proc 1d00000000000004 (worker 10)') cannot be satisfied
[0 - 7f4051dffc80]    0.004134 {2}{cudaipc}: Sending cuda ipc handles to 1 peers
[0 - 7f404867dc80]    0.025525 {2}{cudaipc}: Sender 1 sent nothing to import

and this is from the other

[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: dynamically loading libnvidia-ml.so
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: GPU #0: Tesla P100-SXM2-16GB (6.0) 16276 MB
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: GPU #1: Tesla P100-SXM2-16GB (6.0) 16276 MB
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: GPU #2: Tesla P100-SXM2-16GB (6.0) 16276 MB
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: GPU #3: Tesla P100-SXM2-16GB (6.0) 16276 MB
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: GPU #0 local memory: 366080 MB/s, 13 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: p2p access from device 0 to device 1 bandwidth: 25000 MB/s latency: 100 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: p2p access from device 0 to device 2 bandwidth: 25000 MB/s latency: 100 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: p2p access from device 0 to device 3 bandwidth: 50000 MB/s latency: 100 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: GPU #1 local memory: 366080 MB/s, 13 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: p2p access from device 1 to device 0 bandwidth: 25000 MB/s latency: 100 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: p2p access from device 1 to device 2 bandwidth: 50000 MB/s latency: 100 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: p2p access from device 1 to device 3 bandwidth: 25000 MB/s latency: 100 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: GPU #2 local memory: 366080 MB/s, 13 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: p2p access from device 2 to device 0 bandwidth: 25000 MB/s latency: 100 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: p2p access from device 2 to device 1 bandwidth: 50000 MB/s latency: 100 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: p2p access from device 2 to device 3 bandwidth: 25000 MB/s latency: 100 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: GPU #3 local memory: 366080 MB/s, 13 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: p2p access from device 3 to device 0 bandwidth: 50000 MB/s latency: 100 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: p2p access from device 3 to device 1 bandwidth: 25000 MB/s latency: 100 ns
[1 - 7fa0321c8c80]    0.000000 {2}{gpu}: p2p access from device 3 to device 2 bandwidth: 25000 MB/s latency: 100 ns
[1 - 7fa0321c8c80]    0.018090 {4}{threads}: reservation ('OMP1 proc 1d00010000000004 (worker 10)') cannot be satisfied
[1 - 7fa028a37c80]    0.020440 {2}{cudaipc}: Sender 0 sent nothing to import
[1 - 7fa0321c8c80]    0.025115 {2}{cudaipc}: Sending cuda ipc handles to 1 peers

Ah, that makes some sense, there's a reporting issue in the active message handler. I can fix that real quick, no worries. Thanks for reporting the issue!

@mariodirenzo Sorry, but could you give me the list of arguments you give Realm? I'm curious how you got into the situation where the CUDA IPC paths are enabled, but no memories are sent to be imported. Just to confirm, by multi-node, you mean you're running this across two physically different systems correct? This sounds like the shared_peers path is still hitting the fallback path and collecting all the ranks as IPC capable, which is fine, these paths are robust to that unfortunately. But the following message perplexes me:

[1 - 7fa028a37c80] 0.020440 {2}{cudaipc}: Sender 0 sent nothing to import

I would have expected that you would have at least a GPU_FB_MEM allocated for each GPU in each rank, so there should be at least 4 entries that would have attempted to import (it would first look at the hostnames, those wouldn't have checked out and it would have likely still hung because the initialization signal was still escaped, for which I have a change for).

I need to figure out how to repo your issue as it might uncover other issues with this change that the simple fix I have won't clean up.

Sorry, but could you give me the list of arguments you give Realm?

This is the list of realm flags that I am using
-ll:cpu 1 -ll:ocpu 2 -ll:onuma 1 -ll:othr 15 -ll:ostack 8 -ll:util 1 -ll:io 1 -ll:bgwork 1 -ll:cpu_bgwork 100 -ll:util_bgwork 100 -ll:csize 20000 -lg:eager_alloc_percentage 30 -ll:rsize 512 -ll:ib_rsize 512 -ll:gsize 0 -ll:stacksize 8 -lg:sched -1 -lg:hysteresis 0

Just to confirm, by multi-node, you mean you're running this across two physically different systems correct?

Yes, the run was executed on two different nodes of sapling2

Could you give the following branch a try?

Sure, I'll give it a go tomorrow

@muraj I can reproduce the hang on sapling and I can also confirm that your patch fixed the bug. However, I am not sure why the shared_peers is not empty. We use ipc mailbox to create the shared_peers , so it should be robust for bare metal machine. I will need to take a look at it.

@eddy16112 My guess is the ipc mailbox path is somehow disabled in this compilation. As to the realm flags, it looks like there's no -ll:gpu given, so no fbmems were allocated. I'll add a quick escape for that case, we really shouldn't be doing much inside the cuda module if there are no gpus assigned.

@muraj The reason why shared_peers is not empty is because shared memory is not enabled, so we fallback to rely on the network module to report the shared_peers. The GASNetEX reports an empty shared_peers, which is correct. However, due to the logical here https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/gasnetex/gasnetex_internal.cc?ref_type=heads#L3437, we do not know if an empty shared_peers means GASNetEX can not detect it or there is indeed no shared peers, so we set the shared_peers to all_peers.

yup, that's expected.

cudaipc-fix fixes the issue. Thanks for working on it

Yeah, should be okay now.

@mariodirenzo Go ahead and close this when you're ready.

Thanks again for fixing the issue