NCCL WARN NET/IB : Got completion with error 12, opcode 0, len 0, vendor err 129

Question

NCCL WARN NET/IB : Got completion with error 12, opcode 0, len 0, vendor err 129

NHZlX opened this issue 4 years ago · 13 comments

Openmpi 1.8.5
Nccl 2.8.3
Cuda10.2
MLNX_OFED_LINUX-5.1-2.5.8.0

ibv_devinfo:

hca_id:	mlx5_bond_0
	transport:			InfiniBand (0)
	fw_ver:				16.28.2006
	node_guid:			0c42:a103:0023:ac92
	sys_image_guid:			0c42:a103:0023:ac92
	vendor_id:			0x02c9
	vendor_part_id:			4119
	hw_ver:				0x0
	board_id:			MT_0000000012
	phys_port_cnt:			1
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		1024 (3)
			sm_lid:			0
			port_lid:		0
			port_lmc:		0x00
			link_layer:		Ethernet

This issue like https://github.com/NVIDIA/nccl/issues/214， but i have verified that there is no ACS enabled on either of the nodes.

The following are the command and the error log:

mpirun --allow-run-as-root -np 2 --mca btl tcp,self --mca btl_tcp_if_exclude eth0 -host , -x CUDA_VISIBLE_DEVICES="0,2" -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x NCCL_IB_HCA=mlx5_bond_0:1 -x NCCL_P2P_DISABLE=0 -x NCCL_SHM_DISABLE=0 -x NCCL_IB_DISABLE=0 -x NCCL_IB_CUDA_SUPPORT=0 /home/mingkun/enter/test/nccl-tests/build/all_reduce_perf -b 9 -e 128M -f 2 -g 1 -z 0

# nThread 1 nGpus 1 minBytes 9 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
#   Rank  0 Pid  20109 on machine-17 device  0 [0x1a] Tesla V100-SXM2-32GB
#   Rank  1 Pid  70497 on machine-19 device  0 [0x1a] Tesla V100-SXM2-32GB
machine-17:20109:20109 [0] NCCL INFO Bootstrap : Using eth0:10.11.170.41<0>
machine-17:20109:20109 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
machine-17:20109:20109 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
machine-17:20109:20109 [0] NCCL INFO NET/IB : Using [0]mlx5_bond_0:1/RoCE ; OOB eth0:10.11.170.41<0>
machine-17:20109:20109 [0] NCCL INFO Using network IB
machine-19:70497:70497 [0] NCCL INFO Bootstrap : Using eth0:10.11.170.43<0>
machine-19:70497:70497 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
machine-19:70497:70497 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
machine-19:70497:70497 [0] NCCL INFO NET/IB : Using [0]mlx5_bond_0:1/RoCE ; OOB eth0:10.11.170.43<0>
machine-19:70497:70497 [0] NCCL INFO Using network IB
NCCL version 2.8.3+cuda10.2
machine-19:70497:70509 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
machine-19:70497:70509 [0] NCCL INFO Setting affinity for GPU 0 to 010000,00000001
machine-17:20109:20124 [0] NCCL INFO Channel 00/02 :    0   1
machine-17:20109:20124 [0] NCCL INFO Channel 01/02 :    0   1
machine-17:20109:20124 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
machine-17:20109:20124 [0] NCCL INFO Setting affinity for GPU 0 to 010000,00000001
machine-19:70497:70509 [0] NCCL INFO NCCL_SHM_DISABLE set by environment to 0.
machine-17:20109:20124 [0] NCCL INFO NCCL_SHM_DISABLE set by environment to 0.
machine-19:70497:70509 [0] NCCL INFO Channel 00 : 0[1a000] -> 1[1a000] [receive] via NET/IB/0
machine-17:20109:20124 [0] NCCL INFO Channel 00 : 1[1a000] -> 0[1a000] [receive] via NET/IB/0
machine-19:70497:70509 [0] NCCL INFO Channel 01 : 0[1a000] -> 1[1a000] [receive] via NET/IB/0
machine-19:70497:70509 [0] NCCL INFO Channel 00 : 1[1a000] -> 0[1a000] [send] via NET/IB/0
machine-17:20109:20124 [0] NCCL INFO Channel 01 : 1[1a000] -> 0[1a000] [receive] via NET/IB/0
machine-19:70497:70509 [0] NCCL INFO Channel 01 : 1[1a000] -> 0[1a000] [send] via NET/IB/0
machine-17:20109:20124 [0] NCCL INFO Channel 00 : 0[1a000] -> 1[1a000] [send] via NET/IB/0
machine-17:20109:20124 [0] NCCL INFO Channel 01 : 0[1a000] -> 1[1a000] [send] via NET/IB/0
machine-17:20109:20124 [0] NCCL INFO Connected all rings
machine-17:20109:20124 [0] NCCL INFO Connected all trees
machine-17:20109:20124 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
machine-17:20109:20124 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
machine-19:70497:70509 [0] NCCL INFO Connected all rings
machine-19:70497:70509 [0] NCCL INFO Connected all trees
machine-19:70497:70509 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
machine-19:70497:70509 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
machine-17:20109:20124 [0] NCCL INFO comm 0x468ee30 rank 0 nranks 2 cudaDev 0 busId 1a000 - Init COMPLETE
machine-19:70497:70509 [0] NCCL INFO comm 0x3cfcf70 rank 1 nranks 2 cudaDev 0 busId 1a000 - Init COMPLETE
#
#                                                     out-of-place                       in-place
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
machine-17:20109:20109 [0] NCCL INFO Launch mode Parallel

machine-19:70497:70516 [0] transport/net_ib.cc:839 NCCL WARN NET/IB : Got completion with error 12, opcode 0, len 0, vendor err 129
machine-19:70497:70516 [0] NCCL INFO include/net.h:28 -> 2
machine-19:70497:70516 [0] NCCL INFO transport/net.cc:404 -> 2
machine-19:70497:70516 [0] NCCL INFO proxy.cc:320 -> 2
machine-19:70497:70516 [0] NCCL INFO proxy.cc:367 -> 2 [Proxy Thread]

machine-17:20109:20140 [0] transport/net_ib.cc:839 NCCL WARN NET/IB : Got completion with error 12, opcode 0, len 0, vendor err 129
machine-17:20109:20140 [0] NCCL INFO include/net.h:28 -> 2
machine-17:20109:20140 [0] NCCL INFO transport/net.cc:404 -> 2
machine-17:20109:20140 [0] NCCL INFO proxy.cc:320 -> 2
machine-17:20109:20140 [0] NCCL INFO proxy.cc:367 -> 2 [Proxy Thread]
machine-19: Test NCCL failure common.cu:346 'unhandled system error'
 .. machine-19: Test failure common.cu:395
 .. machine-19: Test failure common.cu:494
 .. machine-19: Test failure all_reduce.cu:103
 .. machine-19: Test failure common.cu:520
 .. machine-19: Test failure common.cu:844
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[51283,1],1]
  Exit code:    3

Answer 1 · 2020-12-03T08:53:40.000Z

-worker-1:946:988 [0] transport/net_ib.cc:839 NCCL WARN NET/IB : Got completion with error 12, opcode 1, len 0, vendor err 129
same problem

Answer 2 · 2020-12-03T11:24:19.000Z

solved it by setting the -x NCCL_IB_GID_INDEX=3

Answer 3 · 2020-12-03T17:13:10.000Z

Sorry for having missed this. An error 12 is a timeout. When it happens right away, it usually means the NICs can't talk to each other using RoCE. The connection can be established because it's not done using RoCE, but then as soon as we start communicating through RoCE we get a timeout.

How to solve that is unfortunately often vendor dependent. Switches can filter packets or fail to route them. I'd suggest running low level RoCE tests first (OFED perftest) and making sure NCCL runs in the same conditions (GID Index, Traffic class, ...).

Answer 4 · 2022-05-09T02:28:24.000Z

solved it by setting the -x NCCL_IB_GID_INDEX=3
@NHZlX Hi, What is the output of this command show_gids in your env?

The NCCL_IB_GID_INDEX variable defines the Global ID index used in RoCE mode. See the InfiniBand show_gids command in order to set this value.

Answer 5 · 2022-06-22T09:23:25.000Z

NCCL_IB_GID_INDEX=3 solved my issue.

show_gids information:

DEV     PORT    INDEX   GID                                     IPv4            VER     DEV
---     ----    -----   ---                                     ------------    ---     ---

mlx5_2  1       0       fe80:0000:0000:0000:5054:00ff:fec2:7a7a                 v1      eth1
mlx5_2  1       1       fe80:0000:0000:0000:5054:00ff:fec2:7a7a                 v2      eth1
mlx5_2  1       2       0000:0000:0000:0000:0000:ffff:0bd8:591d 11.216.89.29    v1      eth1
mlx5_2  1       3       0000:0000:0000:0000:0000:ffff:0bd8:591d 11.216.89.29    v2      eth1
n_gids_found=4

Would you mind giving some explanation about NCCL_IB_GID_INDEX?
Thanks.

Answer 6 · 2022-06-22T09:31:45.000Z

The GID Index is basically determining how IB packets are encapsulated over Ethernet or IP (v4 or v6). I'm no expert, but I think here GID 0 and 1 would be using Ethernet, GID 2 and 3 would use IPv4 and if you had an IPv6 address configured on the interface you would have 2 other GID indexes. And each time you can choose between RoCEv1 and RoCEv2 which have different encapsulation and capabilities.

So choosing the GID index is key in how packets can or cannot be routed through the fabric and how they will be applied QoS policies, etc ... but all this is obviously dependent on how the fabric is configured which is outside the reach of NCCL.

Answer 7 · 2022-09-04T11:16:41.000Z

solved it by setting the -x NCCL_IB_GID_INDEX=3
@NHZlX Hi, What is the output of this command show_gids in your env?
The NCCL_IB_GID_INDEX variable defines the Global ID index used in RoCE mode. See the InfiniBand show_gids command in order to set this value.

I use NCCL_IB_GID_INDEX=3
still have this error.

Answer 8 · 2022-09-29T06:27:24.000Z

@NHZlX How do you solve this problem?

Answer 9 · 2022-09-29T09:47:23.000Z

There is no universal solution to this. Error 12 in IB terms is the same as "No route to host" with sockets.
It could be that you're not using the right interface (NCCL_IB_GID_INDEX), it could be that your IP addressing is wrong, it could be that the switch is down, or because they are in different networks and there is no routing between networks, ... it can be a lot of things. Basically it just says that two NICs could not talk to each other.

Answer 10 · 2023-11-21T12:08:32.000Z

NCCL_IB_GID_INDEX=3

export NCCL_IB_GID_INDEX=3 solved my problem. Thanks very much.

Answer 11 · 2023-11-24T08:48:02.000Z

It worked for me! Thanks!
I'm trying to utilize deepspeed for DDP, but this problem occurs. I added the line NCCL_IB_GID_INDEX=3 to .deepspeed_env and the problem was solved!

Answer 12 · 2024-08-15T11:53:52.000Z

What a magic! Can anyone explain a little bit here?

Answer 13 · 2024-08-15T12:53:47.000Z

With recent NCCL versions you should no longer need to set NCCL_IB_GID_INDEX=3, and doing so can actually work less well in case the GID changes. So I would advise to upgrade NCCL and remove that environment variable from your scripts in the future.