[Bug] Get error when using IB to transfer large size of message
Binyang2014 opened this issue · 1 comments
Binyang2014 commented
Refer to these code:
mscclpp/python/mscclpp_benchmark/allreduce.cu
Lines 618 to 663 in 391d6f8
When registered 6GB memory for IB data transfer and each node using 8 IB interfaces to transfer 128MB data simultaneously. Got error:
mscclppd-dev-000000:28252:28457 [1] MSCCLPP INFO NUMA node of ProxyService proxy thread is set to 1
terminate called after throwing an instance of 'mscclpp::Error'
what(): src is remote, which is not supported (Mscclpp failure: InvalidUsage)
Can be reproduced by using branch binyli/bug
and run with command:
scp -r /root/mscclpp/* mscclppd-dev-000001:/root/mscclpp/>/dev/null 2>&1;mpirun --allow-run-as-root -np 16 --bind-to numa -hostfile /root/hostfile -x MSCCLPP_DEBUG=INFO -x LD_LIBRARY_PATH=/root/mscclpp/build:$LD_LIBRARY_PATH -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_include eth0 -x NCCL_IB_PCI_RELAXED_ORDERING=1 -x NCCL_SOCKET_IFNAME=eth0 -x CUDA_DEVICE_ORDER=PCI_BUS_ID -x NCCL_NET_GDR_LEVEL=5 -x NCCL_TOPO_FILE=/root/ndv4-topo.xml -x NCCL_NET_PLUGIN=none -x NCCL_IB_DISABLE=0 -x NCCL_MIN_NCHANNELS=32 -x NCCL_DEBUG=WARN -x NCCL_P2P_DISABLE=0 -x NCCL_SHM_DISABLE=0 -x MSCCLPP_HOME=/root/mscclpp -x MSCCLPP_DEBUG_SUBSYS=ALL -np 16 -npernode 8 python3 /root/mscclpp/python/mscclpp_benchmark/allreduce_bench.py
Binyang2014 commented
Because of the trigger bit width limitation, only support 32bit offset, not a bug