NCCL WARN NET/IB : Got completion with error 12, opcode 0, len 0, vendor err 129
NHZlX opened this issue ยท 13 comments
Openmpi 1.8.5
Nccl 2.8.3
Cuda10.2
MLNX_OFED_LINUX-5.1-2.5.8.0
ibv_devinfo:
hca_id: mlx5_bond_0
transport: InfiniBand (0)
fw_ver: 16.28.2006
node_guid: 0c42:a103:0023:ac92
sys_image_guid: 0c42:a103:0023:ac92
vendor_id: 0x02c9
vendor_part_id: 4119
hw_ver: 0x0
board_id: MT_0000000012
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
This issue like https://github.com/NVIDIA/nccl/issues/214๏ผ but i have verified that there is no ACS enabled on either of the nodes.
The following are the command and the error log:
mpirun --allow-run-as-root -np 2 --mca btl tcp,self --mca btl_tcp_if_exclude eth0 -host , -x CUDA_VISIBLE_DEVICES="0,2" -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x NCCL_IB_HCA=mlx5_bond_0:1 -x NCCL_P2P_DISABLE=0 -x NCCL_SHM_DISABLE=0 -x NCCL_IB_DISABLE=0 -x NCCL_IB_CUDA_SUPPORT=0 /home/mingkun/enter/test/nccl-tests/build/all_reduce_perf -b 9 -e 128M -f 2 -g 1 -z 0
# nThread 1 nGpus 1 minBytes 9 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
# Rank 0 Pid 20109 on machine-17 device 0 [0x1a] Tesla V100-SXM2-32GB
# Rank 1 Pid 70497 on machine-19 device 0 [0x1a] Tesla V100-SXM2-32GB
machine-17:20109:20109 [0] NCCL INFO Bootstrap : Using eth0:10.11.170.41<0>
machine-17:20109:20109 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
machine-17:20109:20109 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
machine-17:20109:20109 [0] NCCL INFO NET/IB : Using [0]mlx5_bond_0:1/RoCE ; OOB eth0:10.11.170.41<0>
machine-17:20109:20109 [0] NCCL INFO Using network IB
machine-19:70497:70497 [0] NCCL INFO Bootstrap : Using eth0:10.11.170.43<0>
machine-19:70497:70497 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
machine-19:70497:70497 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
machine-19:70497:70497 [0] NCCL INFO NET/IB : Using [0]mlx5_bond_0:1/RoCE ; OOB eth0:10.11.170.43<0>
machine-19:70497:70497 [0] NCCL INFO Using network IB
NCCL version 2.8.3+cuda10.2
machine-19:70497:70509 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
machine-19:70497:70509 [0] NCCL INFO Setting affinity for GPU 0 to 010000,00000001
machine-17:20109:20124 [0] NCCL INFO Channel 00/02 : 0 1
machine-17:20109:20124 [0] NCCL INFO Channel 01/02 : 0 1
machine-17:20109:20124 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
machine-17:20109:20124 [0] NCCL INFO Setting affinity for GPU 0 to 010000,00000001
machine-19:70497:70509 [0] NCCL INFO NCCL_SHM_DISABLE set by environment to 0.
machine-17:20109:20124 [0] NCCL INFO NCCL_SHM_DISABLE set by environment to 0.
machine-19:70497:70509 [0] NCCL INFO Channel 00 : 0[1a000] -> 1[1a000] [receive] via NET/IB/0
machine-17:20109:20124 [0] NCCL INFO Channel 00 : 1[1a000] -> 0[1a000] [receive] via NET/IB/0
machine-19:70497:70509 [0] NCCL INFO Channel 01 : 0[1a000] -> 1[1a000] [receive] via NET/IB/0
machine-19:70497:70509 [0] NCCL INFO Channel 00 : 1[1a000] -> 0[1a000] [send] via NET/IB/0
machine-17:20109:20124 [0] NCCL INFO Channel 01 : 1[1a000] -> 0[1a000] [receive] via NET/IB/0
machine-19:70497:70509 [0] NCCL INFO Channel 01 : 1[1a000] -> 0[1a000] [send] via NET/IB/0
machine-17:20109:20124 [0] NCCL INFO Channel 00 : 0[1a000] -> 1[1a000] [send] via NET/IB/0
machine-17:20109:20124 [0] NCCL INFO Channel 01 : 0[1a000] -> 1[1a000] [send] via NET/IB/0
machine-17:20109:20124 [0] NCCL INFO Connected all rings
machine-17:20109:20124 [0] NCCL INFO Connected all trees
machine-17:20109:20124 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
machine-17:20109:20124 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
machine-19:70497:70509 [0] NCCL INFO Connected all rings
machine-19:70497:70509 [0] NCCL INFO Connected all trees
machine-19:70497:70509 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
machine-19:70497:70509 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
machine-17:20109:20124 [0] NCCL INFO comm 0x468ee30 rank 0 nranks 2 cudaDev 0 busId 1a000 - Init COMPLETE
machine-19:70497:70509 [0] NCCL INFO comm 0x3cfcf70 rank 1 nranks 2 cudaDev 0 busId 1a000 - Init COMPLETE
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
machine-17:20109:20109 [0] NCCL INFO Launch mode Parallel
machine-19:70497:70516 [0] transport/net_ib.cc:839 NCCL WARN NET/IB : Got completion with error 12, opcode 0, len 0, vendor err 129
machine-19:70497:70516 [0] NCCL INFO include/net.h:28 -> 2
machine-19:70497:70516 [0] NCCL INFO transport/net.cc:404 -> 2
machine-19:70497:70516 [0] NCCL INFO proxy.cc:320 -> 2
machine-19:70497:70516 [0] NCCL INFO proxy.cc:367 -> 2 [Proxy Thread]
machine-17:20109:20140 [0] transport/net_ib.cc:839 NCCL WARN NET/IB : Got completion with error 12, opcode 0, len 0, vendor err 129
machine-17:20109:20140 [0] NCCL INFO include/net.h:28 -> 2
machine-17:20109:20140 [0] NCCL INFO transport/net.cc:404 -> 2
machine-17:20109:20140 [0] NCCL INFO proxy.cc:320 -> 2
machine-17:20109:20140 [0] NCCL INFO proxy.cc:367 -> 2 [Proxy Thread]
machine-19: Test NCCL failure common.cu:346 'unhandled system error'
.. machine-19: Test failure common.cu:395
.. machine-19: Test failure common.cu:494
.. machine-19: Test failure all_reduce.cu:103
.. machine-19: Test failure common.cu:520
.. machine-19: Test failure common.cu:844
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[51283,1],1]
Exit code: 3
-worker-1:946:988 [0] transport/net_ib.cc:839 NCCL WARN NET/IB : Got completion with error 12, opcode 1, len 0, vendor err 129
same problem
solved it by setting the -x NCCL_IB_GID_INDEX=3
Sorry for having missed this. An error 12 is a timeout. When it happens right away, it usually means the NICs can't talk to each other using RoCE. The connection can be established because it's not done using RoCE, but then as soon as we start communicating through RoCE we get a timeout.
How to solve that is unfortunately often vendor dependent. Switches can filter packets or fail to route them. I'd suggest running low level RoCE tests first (OFED perftest) and making sure NCCL runs in the same conditions (GID Index, Traffic class, ...).
solved it by setting the
-x NCCL_IB_GID_INDEX=3
@NHZlX Hi, What is the output of this commandshow_gids
in your env?
The NCCL_IB_GID_INDEX variable defines the Global ID index used in RoCE mode. See the InfiniBand show_gids command in order to set this value.
NCCL_IB_GID_INDEX=3
solved my issue.
show_gids
information:
DEV PORT INDEX GID IPv4 VER DEV
--- ---- ----- --- ------------ --- ---
mlx5_2 1 0 fe80:0000:0000:0000:5054:00ff:fec2:7a7a v1 eth1
mlx5_2 1 1 fe80:0000:0000:0000:5054:00ff:fec2:7a7a v2 eth1
mlx5_2 1 2 0000:0000:0000:0000:0000:ffff:0bd8:591d 11.216.89.29 v1 eth1
mlx5_2 1 3 0000:0000:0000:0000:0000:ffff:0bd8:591d 11.216.89.29 v2 eth1
n_gids_found=4
Would you mind giving some explanation about NCCL_IB_GID_INDEX
?
Thanks.
The GID Index is basically determining how IB packets are encapsulated over Ethernet or IP (v4 or v6). I'm no expert, but I think here GID 0 and 1 would be using Ethernet, GID 2 and 3 would use IPv4 and if you had an IPv6 address configured on the interface you would have 2 other GID indexes. And each time you can choose between RoCEv1 and RoCEv2 which have different encapsulation and capabilities.
So choosing the GID index is key in how packets can or cannot be routed through the fabric and how they will be applied QoS policies, etc ... but all this is obviously dependent on how the fabric is configured which is outside the reach of NCCL.
solved it by setting the
-x NCCL_IB_GID_INDEX=3
@NHZlX Hi, What is the output of this commandshow_gids
in your env?The NCCL_IB_GID_INDEX variable defines the Global ID index used in RoCE mode. See the InfiniBand show_gids command in order to set this value.
I use NCCL_IB_GID_INDEX=3
still have this error.
@NHZlX How do you solve this problem?
There is no universal solution to this. Error 12 in IB terms is the same as "No route to host" with sockets.
It could be that you're not using the right interface (NCCL_IB_GID_INDEX), it could be that your IP addressing is wrong, it could be that the switch is down, or because they are in different networks and there is no routing between networks, ... it can be a lot of things. Basically it just says that two NICs could not talk to each other.
NCCL_IB_GID_INDEX=3
export NCCL_IB_GID_INDEX=3 solved my problem. Thanks very much.
It worked for me! Thanks!
I'm trying to utilize deepspeed for DDP, but this problem occurs. I added the line NCCL_IB_GID_INDEX=3
to .deepspeed_env
and the problem was solved!
What a magic! Can anyone explain a little bit here?
With recent NCCL versions you should no longer need to set NCCL_IB_GID_INDEX=3, and doing so can actually work less well in case the GID changes. So I would advise to upgrade NCCL and remove that environment variable from your scripts in the future.