EleutherAI/gpt-neox

NCCL error in: ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3

mackmake opened this issue · 4 comments

hi
i started training on two nodes and used 125M.yml config file and only changed the directories for data and tokenizer files. also added my own hostfile. now during training it gives me this error:

NCCL error in: ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3

i run the code again with NCCL_DEBUG=INFO and got this:

node2: Last error:
node2: Net : Call to recv from NODE2_IP<56843> failed : Connection refused                                                                                                         
node1: node1:15874:17284 [4] NCCL INFO P2P is disabled between connected GPUs 4 and 3. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.                              
node1: node1:15874:17284 [4] NCCL INFO Could not enable P2P between dev 4(=61000) and dev 3(=42000)                                                                               
node1: node1:15874:17284 [4] NCCL INFO Channel 00 : 4[61000] -> 3[42000] via SHM/direct/direct                                                                                    
node1: node1:15874:17284 [4] NCCL INFO P2P is disabled between connected GPUs 4 and 3. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.                              
node1: node1:15874:17284 [4] NCCL INFO Could not enable P2P between dev 4(=61000) and dev 3(=42000)                                                                               
node1: node1:15874:17284 [4] NCCL INFO Channel 01 : 4[61000] -> 3[42000] via SHM/direct/direct                                                                                    
node1: node1:15876:17282 [5] NCCL INFO Connected all trees                                                                                                                        
node1: node1:15876:17282 [5] NCCL INFO threadThresholds 8/8/64 | 96/8/64 | 512 | 512                                                                                              
node1: node1:15876:17282 [5] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer                                                                                   
node1: node1:15874:17284 [4] NCCL INFO Connected all trees                                                                                                                        
node1: node1:15874:17284 [4] NCCL INFO threadThresholds 8/8/64 | 96/8/64 | 512 | 512                                                                                              
node1: node1:15874:17284 [4] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer                                                                                   
node1: node1:15872:17277 [3] NCCL INFO Connected all trees                                                                                                                        
node1: node1:15872:17277 [3] NCCL INFO threadThresholds 8/8/64 | 96/8/64 | 512 | 512                                                                                              
node1: node1:15872:17277 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer                                                                                   
node2: node2:5496:6219 [0] NCCL INFO Channel 00/0 : 0[1000] -> 6[1000] [receive] via NET/Socket/0                                                                                 
node2: node2:5496:6219 [0] NCCL INFO Channel 01/0 : 0[1000] -> 6[1000] [receive] via NET/Socket/0                                                                                 
node2: node2:5496:6219 [0] NCCL INFO Channel 00/0 : 6[1000] -> 0[1000] [send] via NET/Socket/0                                                                                    
node2: node2:5496:6219 [0] NCCL INFO Channel 01/0 : 6[1000] -> 0[1000] [send] via NET/Socket/0     

what might be the problem?
how to solve it?

What happens when you run with NCCL_IGNORE_DISABLED_P2P=1 set? Does it crash, or does it run less efficiently than one would desire?

if i can remember, it crashed as i tested with NCCL_IGNORE_DISABLED_P2P. i stopped using multi-node approach.

NCCL_IGNORE_DISABLED_P2P=1 just disables the warning message. I think NCCL_P2P_DISABLE=1 is what you'd need?

@mackmake -- Please reopen if this doesn't resolve your issue!