microsoft/mscclpp

[Bug] Get error when using IB to transfer large size of message

Binyang2014 opened this issue · 1 comments

Refer to these code:

int pipelineSize = pipelineDepth;
// float nBlocksForReduceScatterRatio = 0.8;
const size_t chunkSize = nelems / worldSize;
const int peerRank = (rank + nRanksPerNode) % worldSize;
// int peerNodeId = peerRank / nRanksPerNode;
// int nBlocksForReduceScatter =
// (int)(nBlocksForReduceScatterRatio * gridDim.x) / (nRanksPerNode - 1) * (nRanksPerNode - 1);
int isComm = (threadIdx.x == 0) && (blockIdx.x == 0);
int peer = (peerRank < rank) ? peerRank : peerRank - 1;
// int nBlocksRemain = gridDim.x - nBlocksForReduceScatter;
mscclpp::SimpleProxyChannelDeviceHandle proxyChan = proxyChans[peer];
// if (peerNodeId == rank / nRanksPerNode) {
// localReduceScatterSm(smChans, buff, rank, nRanksPerNode, 0, 0, chunkSize, chunkSize, gridDim.x);
// return;
// }
// step 1: local reduce
// int startChunkIndex = peerNodeId * nRanksPerNode;
// localReduceScatterSm(smChans, buff, rank, nRanksPerNode, startChunkIndex, 0, chunkSize, chunkSize / pipelineSize,
// nBlocksForReduceScatter);
deviceSyncer.sync(gridDim.x);
// step 2: local reduce and exchange data with neighbor
if (isComm) {
size_t offset = (peerRank * chunkSize) * sizeof(int);
// opposite side
proxyChan.putWithSignal(offset, (chunkSize / pipelineSize * sizeof(int)));
}
// if (pipelineSize > 1)
// localReduceScatterSm(smChans, buff, rank, nRanksPerNode, startChunkIndex, chunkSize / pipelineSize, chunkSize,
// (pipelineSize - 1) * chunkSize / pipelineSize, nBlocksForReduceScatter);
if (isComm) {
proxyChan.wait();
}
// if (blockIdx.x >= nBlocksForReduceScatter) {
// ibDeviceSyncer.sync(nBlocksRemain);
// // reduce data received from peer to related rank
// size_t offset = rank * chunkSize * sizeof(int);
// int* dst = (int*)((char*)buff + offset);
// int* src = (int*)((char*)scratch + offset);
// vectorSum((TYPE*)dst, (TYPE*)src, chunkSize / pipelineSize, blockIdx.x - nBlocksForReduceScatter, nBlocksRemain);
// }
if (isComm) {
proxyChan.flush();
}
deviceSyncer.sync(gridDim.x);

When registered 6GB memory for IB data transfer and each node using 8 IB interfaces to transfer 128MB data simultaneously. Got error:

mscclppd-dev-000000:28252:28457 [1] MSCCLPP INFO NUMA node of ProxyService proxy thread is set to 1
terminate called after throwing an instance of 'mscclpp::Error'
  what():  src is remote, which is not supported (Mscclpp failure: InvalidUsage)

Can be reproduced by using branch binyli/bug and run with command:

scp -r /root/mscclpp/* mscclppd-dev-000001:/root/mscclpp/>/dev/null 2>&1;mpirun --allow-run-as-root -np 16 --bind-to numa   -hostfile /root/hostfile -x MSCCLPP_DEBUG=INFO -x LD_LIBRARY_PATH=/root/mscclpp/build:$LD_LIBRARY_PATH   -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_include eth0 -x NCCL_IB_PCI_RELAXED_ORDERING=1 -x NCCL_SOCKET_IFNAME=eth0   -x CUDA_DEVICE_ORDER=PCI_BUS_ID -x NCCL_NET_GDR_LEVEL=5 -x NCCL_TOPO_FILE=/root/ndv4-topo.xml   -x NCCL_NET_PLUGIN=none -x NCCL_IB_DISABLE=0 -x NCCL_MIN_NCHANNELS=32 -x NCCL_DEBUG=WARN -x NCCL_P2P_DISABLE=0 -x NCCL_SHM_DISABLE=0   -x MSCCLPP_HOME=/root/mscclpp -x MSCCLPP_DEBUG_SUBSYS=ALL  -np 16 -npernode 8 python3 /root/mscclpp/python/mscclpp_benchmark/allreduce_bench.py

Because of the trigger bit width limitation, only support 32bit offset, not a bug