[Bug] C++ exception with description "ibv_modify_qp failed (errno 19) (Ib failure: No such device)" thrown in the test body.
FC-Li opened this issue · 9 comments
Are you using IB or RoCE for cross node communication? Can you run ibstat
and put the output here?
Thank you for your swift response. This is what ibstat
got.
CA 'mlx5_0'
CA type: MT4125
Number of ports: 1
Firmware version: 22.38.1900
Hardware version: 0
Node GUID: 0x946dae0300907832
System image GUID: 0x946dae0300907832
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x0000000000000000
Link layer: Ethernet
CA 'mlx5_1'
CA type: MT4125
Number of ports: 1
Firmware version: 22.38.1900
Hardware version: 0
Node GUID: 0x946dae0300907833
System image GUID: 0x946dae0300907832
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x0000000000000000
Link layer: Ethernet
CA 'mlx5_2'
CA type: MT4125
Number of ports: 1
Firmware version: 22.38.1900
Hardware version: 0
Node GUID: 0x946dae0300906062
System image GUID: 0x946dae0300906062
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x0000000000000000
Link layer: Ethernet
CA 'mlx5_3'
CA type: MT4125
Number of ports: 1
Firmware version: 22.38.1900
Hardware version: 0
Node GUID: 0x946dae0300906063
System image GUID: 0x946dae0300906062
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x0000000000000000
Link layer: Ethernet
CA 'mlx5_4'
CA type: MT4123
Number of ports: 1
Firmware version: 20.38.1900
Hardware version: 0
Node GUID: 0xe8ebd3030058998c
System image GUID: 0xe8ebd3030058998c
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x0000000000000000
Link layer: Ethernet
CA 'mlx5_5'
CA type: MT4123
Number of ports: 1
Firmware version: 20.38.1900
Hardware version: 0
Node GUID: 0xe8ebd3030058998d
System image GUID: 0xe8ebd3030058998c
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x0000000000000000
Link layer: Ethernet
CA 'mlx5_6'
CA type: MT4123
Number of ports: 1
Firmware version: 20.38.1900
Hardware version: 0
Node GUID: 0xe8ebd303005899bc
System image GUID: 0xe8ebd303005899bc
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x0000000000000000
Link layer: Ethernet
CA 'mlx5_7'
CA type: MT4123
Number of ports: 1
Firmware version: 20.38.1900
Hardware version: 0
Node GUID: 0xe8ebd303005899bd
System image GUID: 0xe8ebd303005899bc
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x0000000000000000
Link layer: Ethernet
@Binyang2014
ibv_rc_pingpong is okay.
So I change gid index to 7. Mscclpp is still not working.
set-hldy-nlp-llm-inference08:165895:165895 [0] MSCCLPP INFO TcpBootstrap : Using eth0:10.245.226.22<0>
set-hldy-nlp-llm-inference08:165895:165895 [0] MSCCLPP INFO TcpBootstrap : Using eth0:10.245.226.22<0>
set-hldy-nlp-llm-inference08:165895:165895 [0] MSCCLPP INFO rank 0 nranks 2 - connecting to 10.245.226.22<54879>
devAttr.phys_port_cnt: 1
devAttr.phys_port_cnt: 1
Port: 1
IbQpInfo: lid: 0 port: 1 linkLayer: 2 qpn: 30029 spn: 0 mtu: 3 iid: 1649149841164402688 is_grh: 1
unknown file: Failure
C++ exception with description "ibv_modify_qp failed (errno 19) (Ib failure: No such device)" thrown in the test body.
unknown file: Failure
C++ exception with description "ibv_modify_qp failed in IbQp::rtr (errno 19) (Ib failure: No such device)" thrown in the test body.
We only run the unit tests in InfiniBand environment, which gid index is 0. For RoCE, seems we need a way to find the GID index, or let user pass the GID index via env variable.
@Binyang2014 Passing the GID index via env variable is what NCCL does.
@Binyang2014
I write a kernel for both cross node and within node communication.
__device__ void allGatherSendPlan(CommPlanKernelParams p) {
int warpSize = 32;
int warpIdx = threadIdx.x / warpSize;
int laneIdx = threadIdx.x % warpSize;
if (warpIdx < p.ep_world_size) {
int remoteRank = warpIdx;
auto smChan = p.smChans[remoteRank];
smChan.put<4>(
p.my_rank * p.ep_world_size * p.num_stages * sizeof(int),
0,
p.ep_world_size * p.num_stages * sizeof(int),
laneIdx,
warpSize);
if (laneIdx == 0) {
smChan.signal();
smChan.wait();
}
}
__syncthreads();
}
It works well when there are not cross node communication. But If I have one GPU per Node, I got a an illegal memory access was encountered
.
Does the put
, signal
and wait
work when doing cross node communication?
For cross node communication, pls use proxy channel, sm channel only for intra node communication. Here is the doc for proxy channel setup: https://microsoft.github.io/mscclpp/getting-started/tutorials/initialization.html#initialize-communication-with-c-api, and a sample code for how to use the proxy channel:
mscclpp/test/mscclpp-test/allgather_test.cu
Lines 30 to 50 in 3e7801b
Thank you for your help. I finally made it work. Close this issue.