microsoft/mscclpp

[Bug] C++ exception with description "ibv_modify_qp failed (errno 19) (Ib failure: No such device)" thrown in the test body.

FC-Li opened this issue · 9 comments

FC-Li commented

Hi, I am using mscclp to do cross node communication. But I got error as following. Please help me out.
Image

Are you using IB or RoCE for cross node communication? Can you run ibstat and put the output here?

FC-Li commented

Thank you for your swift response. This is what ibstat got.

CA 'mlx5_0'
	CA type: MT4125
	Number of ports: 1
	Firmware version: 22.38.1900
	Hardware version: 0
	Node GUID: 0x946dae0300907832
	System image GUID: 0x946dae0300907832
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 100
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0x0000000000000000
		Link layer: Ethernet
CA 'mlx5_1'
	CA type: MT4125
	Number of ports: 1
	Firmware version: 22.38.1900
	Hardware version: 0
	Node GUID: 0x946dae0300907833
	System image GUID: 0x946dae0300907832
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 100
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0x0000000000000000
		Link layer: Ethernet
CA 'mlx5_2'
	CA type: MT4125
	Number of ports: 1
	Firmware version: 22.38.1900
	Hardware version: 0
	Node GUID: 0x946dae0300906062
	System image GUID: 0x946dae0300906062
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 100
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0x0000000000000000
		Link layer: Ethernet
CA 'mlx5_3'
	CA type: MT4125
	Number of ports: 1
	Firmware version: 22.38.1900
	Hardware version: 0
	Node GUID: 0x946dae0300906063
	System image GUID: 0x946dae0300906062
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 100
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0x0000000000000000
		Link layer: Ethernet
CA 'mlx5_4'
	CA type: MT4123
	Number of ports: 1
	Firmware version: 20.38.1900
	Hardware version: 0
	Node GUID: 0xe8ebd3030058998c
	System image GUID: 0xe8ebd3030058998c
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 100
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0x0000000000000000
		Link layer: Ethernet
CA 'mlx5_5'
	CA type: MT4123
	Number of ports: 1
	Firmware version: 20.38.1900
	Hardware version: 0
	Node GUID: 0xe8ebd3030058998d
	System image GUID: 0xe8ebd3030058998c
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 100
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0x0000000000000000
		Link layer: Ethernet
CA 'mlx5_6'
	CA type: MT4123
	Number of ports: 1
	Firmware version: 20.38.1900
	Hardware version: 0
	Node GUID: 0xe8ebd303005899bc
	System image GUID: 0xe8ebd303005899bc
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 100
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0x0000000000000000
		Link layer: Ethernet
CA 'mlx5_7'
	CA type: MT4123
	Number of ports: 1
	Firmware version: 20.38.1900
	Hardware version: 0
	Node GUID: 0xe8ebd303005899bd
	System image GUID: 0xe8ebd303005899bc
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 100
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0x0000000000000000
		Link layer: Ethernet
FC-Li commented

@Binyang2014
ibv_rc_pingpong is okay.

Image
Image

So I change gid index to 7. Mscclpp is still not working.
Image

set-hldy-nlp-llm-inference08:165895:165895 [0] MSCCLPP INFO TcpBootstrap : Using eth0:10.245.226.22<0>
set-hldy-nlp-llm-inference08:165895:165895 [0] MSCCLPP INFO TcpBootstrap : Using eth0:10.245.226.22<0>
set-hldy-nlp-llm-inference08:165895:165895 [0] MSCCLPP INFO rank 0 nranks 2 - connecting to 10.245.226.22<54879>
devAttr.phys_port_cnt: 1
devAttr.phys_port_cnt: 1
Port: 1
IbQpInfo: lid: 0 port: 1 linkLayer: 2 qpn: 30029 spn: 0 mtu: 3 iid: 1649149841164402688 is_grh: 1

unknown file: Failure
C++ exception with description "ibv_modify_qp failed (errno 19) (Ib failure: No such device)" thrown in the test body.

unknown file: Failure
C++ exception with description "ibv_modify_qp failed in IbQp::rtr (errno 19) (Ib failure: No such device)" thrown in the test body.
FC-Li commented

I figured it out. I have to change this line too.
Image

We only run the unit tests in InfiniBand environment, which gid index is 0. For RoCE, seems we need a way to find the GID index, or let user pass the GID index via env variable.

FC-Li commented

@Binyang2014 Passing the GID index via env variable is what NCCL does.

FC-Li commented

@Binyang2014
I write a kernel for both cross node and within node communication.

__device__ void allGatherSendPlan(CommPlanKernelParams p) {
  int warpSize = 32;
  int warpIdx = threadIdx.x / warpSize;
  int laneIdx = threadIdx.x % warpSize;
  if (warpIdx < p.ep_world_size) {
    int remoteRank = warpIdx;
    auto smChan = p.smChans[remoteRank];
    smChan.put<4>(
        p.my_rank * p.ep_world_size * p.num_stages * sizeof(int),
        0,
        p.ep_world_size * p.num_stages * sizeof(int),
        laneIdx,
        warpSize);
    if (laneIdx == 0) {
      smChan.signal();
      smChan.wait();
    }
  }
  __syncthreads();
}

It works well when there are not cross node communication. But If I have one GPU per Node, I got a an illegal memory access was encountered.
Does the put, signal and wait work when doing cross node communication?

For cross node communication, pls use proxy channel, sm channel only for intra node communication. Here is the doc for proxy channel setup: https://microsoft.github.io/mscclpp/getting-started/tutorials/initialization.html#initialize-communication-with-c-api, and a sample code for how to use the proxy channel:

__global__ void __launch_bounds__(1024) allgather0(int rank, size_t nelemsPerGPU) {
int warpId = threadIdx.x / WARP_SIZE;
// Each warp is responsible for one of the remote ranks
DeviceHandle<mscclpp::ProxyChannel> proxyChan = constProxyChans[warpId];
// this allgather is really simple and implemented as an alltoall
// this thread's role is a sender role
// put your data asynchronously
if (threadIdx.x % WARP_SIZE == 0) {
proxyChan.putWithSignal(rank * nelemsPerGPU * sizeof(int), nelemsPerGPU * sizeof(int));
}
// make sure everyone is put their data before some thread randomly blocks everyone else in signal
__syncthreads();
// push with flag and sync to make sure the data is received
if (threadIdx.x % WARP_SIZE == 0) proxyChan.flush();
// this thread's role is a receiver role. wait on the semaphore to make sure the data is ready
if (threadIdx.x % WARP_SIZE == 0) proxyChan.wait();
}

FC-Li commented

Thank you for your help. I finally made it work. Close this issue.