aws/aws-ofi-nccl

NCCL WARN NET/OFI Request completed with error. RC: 21. Error: unknown error

Ridhamz-nd opened this issue · 2 comments

Hello aws_ofi_nccl maintainers,

Please let me know if this is not the best location to post the issue and I will close this issue.

I am unable to figure out why the process is hanging after the error message is shown.

My training setup:
2 ml.g4dn.12xlarge instances on AWS Sagemaker trying to run distributed training with Pytorch base image 763104351884.dkr.ecr.us-west-1.amazonaws.com/pytorch-training:1.12.1-gpu-py38-cu113-ubuntu20.04-sagemaker.
The two instances are running inside a private subnet with a NAT gateway attached to the subnet.

All outputs are from host-1
Output of lspci -i efa:

00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
00:04.0 Non-Volatile memory controller: Amazon.com, Inc. Device 8061
00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
00:06.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
00:07.0 Ethernet controller: Amazon.com, Inc. Elastic Fabric Adapter (EFA)
00:08.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
00:09.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
00:1a.0 Non-Volatile memory controller: Amazon.com, Inc. Device 8061
00:1b.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
00:1c.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
00:1d.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
00:1e.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller

Output of cat /opt/amazon/efa_installed_packages:

# EFA installer version: 1.15.1
# Debug packages installed: no
# Packages installed:
efa-config_1.9_all efa-profile_1.5_all libfabric-aws-bin_1.14.0amzn1.0_amd64 libfabric-aws-dev_1.14.0amzn1.0_amd64 libfabric1-aws_1.14.0amzn1.0_amd64 openmpi40-aws_4.1.2-1_amd64 ibacm_39.0-1_amd64 ibverbs-providers_39.0-1_amd64 ibverbs-utils_39.0-1_amd64 infiniband-diags_39.0-1_amd64 libibmad-dev_39.0-1_amd64 libibmad5_39.0-1_amd64 libibnetdisc-dev_39.0-1_amd64 libibnetdisc5_39.0-1_amd64 libibumad-dev_39.0-1_amd64 libibumad3_39.0-1_amd64 libibverbs-dev_39.0-1_amd64 libibverbs1_39.0-1_amd64 librdmacm-dev_39.0-1_amd64 librdmacm1_39.0-1_amd64 rdma-core_39.0-1_amd64 rdmacm-utils_39.0-1_amd64

Output of /opt/amazon/efa/bin/fi_info -p efa:

provider: efa
    fabric: EFA-fe80::424:a9ff:fed5:b935
    domain: efa_0-rdm
    version: 114.10
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
provider: efa
    fabric: EFA-fe80::424:a9ff:fed5:b935
    domain: efa_0-dgrm
    version: 114.10
    type: FI_EP_DGRAM
    protocol: FI_PROTO_EFA

Output of training job:
distributed training is initialized with nccl backend in pytorch using the mmaction2 training library. I set FI_EFA_USE_DEVICE_RDMA=0 because the T4 gpus do not support RDMA. Also, the cmd is run as os.system() command in the entrypoint passed to sagemaker
cmd=

NCCL_SOCKET_IFNAME=eth0 FI_PROVIDER="efa" FI_EFA_USE_DEVICE_RDMA=0 NCCL_DEBUG=INFO FI_LOG_LEVEL=warn FI_LOG_PROV=efa PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python38.zip:/opt/conda/lib/python3.8:/opt/conda/lib/python3.8/lib-dynload:/opt/conda/lib/python3.8/site-packages:/opt/conda/lib/python3.8/site-packages/smdebug-1.0.22b20221214-py3.8.egg:/opt/conda/lib/python3.8/site-packages/pyinstrument-3.4.2-py3.8.egg:/opt/conda/lib/python3.8/site-packages/pyinstrument_cext-0.2.4-py3.8-linux-x86_64.egg:/opt/conda/lib/python3.8/site-packages/flash_attn-0.1-py3.8-linux-x86_64.egg:/opt/conda/lib/python3.8/site-packages/einops-0.6.0-py3.8.egg python -m torch.distributed.launch --nnodes=2 --node_rank=0  --master_addr=algo-1  --nproc_per_node=4  --master_port=7777  <train script> <config.py>
algo-1:462:462 [0] NCCL INFO Bootstrap : Using eth0:10.200.5.135<0>
algo-1:462:462 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
algo-1:462:462 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
algo-1:462:462 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
libfabric:462:1673041636:efa:core:rxr_info_to_rxr():506<warn> FI_HMEM capability requires RDMA, which this device does not support.
algo-1:462:462 [0] NCCL INFO NET/OFI Forcing AWS OFI ndev 2
algo-1:462:462 [0] NCCL INFO NET/OFI Selected Provider is efa
algo-1:462:462 [0] NCCL INFO Using network AWS Libfabric
NCCL version 2.10.3+cuda11.3
algo-1:463:463 [1] NCCL INFO Bootstrap : Using eth0:10.200.5.135<0>
algo-1:464:464 [2] NCCL INFO Bootstrap : Using eth0:10.200.5.135<0>
algo-1:465:465 [3] NCCL INFO Bootstrap : Using eth0:10.200.5.135<0>
algo-1:465:465 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
algo-1:465:465 [3] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
algo-1:464:464 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
algo-1:463:463 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
algo-1:464:464 [2] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
algo-1:463:463 [1] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
algo-1:465:465 [3] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
algo-1:464:464 [2] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
algo-1:463:463 [1] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
libfabric:465:1673041636:efa:core:rxr_info_to_rxr():506<warn> FI_HMEM capability requires RDMA, which this device does not support.
algo-1:465:465 [3] NCCL INFO NET/OFI Forcing AWS OFI ndev 2
algo-1:465:465 [3] NCCL INFO NET/OFI Selected Provider is efa
algo-1:465:465 [3] NCCL INFO Using network AWS Libfabric
libfabric:463:1673041636:efa:core:rxr_info_to_rxr():506<warn> FI_HMEM capability requires RDMA, which this device does not support.
libfabric:464:1673041636:efa:core:rxr_info_to_rxr():506<warn> FI_HMEM capability requires RDMA, which this device does not support.
algo-1:463:463 [1] NCCL INFO NET/OFI Forcing AWS OFI ndev 2
algo-1:463:463 [1] NCCL INFO NET/OFI Selected Provider is efa
algo-1:463:463 [1] NCCL INFO Using network AWS Libfabric
algo-1:464:464 [2] NCCL INFO NET/OFI Forcing AWS OFI ndev 2
algo-1:464:464 [2] NCCL INFO NET/OFI Selected Provider is efa
algo-1:464:464 [2] NCCL INFO Using network AWS Libfabric
algo-1:464:558 [2] NCCL INFO NET/OFI [2] getCudaPath dev 0 busId 0000:00:1b.0 path /sys/devices/pci0000:00
algo-1:462:555 [0] NCCL INFO NET/OFI [0] getCudaPath dev 0 busId 0000:00:1b.0 path /sys/devices/pci0000:00/
algo-1:463:557 [1] NCCL INFO NET/OFI [1] getCudaPath dev 0 busId 0000:00:1b.0 path /sys/devices/pci0000:00/
algo-1:465:556 [3] NCCL INFO NET/OFI [3] getCudaPath dev 0 busId 0000:00:1b.0 path /sys/devices/pci0000:00
algo-1:465:556 [3] NCCL INFO NET/OFI [3] getCudaPath dev 1 busId 0000:00:1c.0 path /sys/devices/pci0000:00/
algo-1:464:558 [2] NCCL INFO NET/OFI [2] getCudaPath dev 1 busId 0000:00:1c.0 path /sys/devices/pci0000:00/
algo-1:463:557 [1] NCCL INFO NET/OFI [1] getCudaPath dev 1 busId 0000:00:1c.0 path /sys/devices/pci0000:00
algo-1:462:555 [0] NCCL INFO NET/OFI [0] getCudaPath dev 1 busId 0000:00:1c.0 path /sys/devices/pci0000:00
algo-1:465:556 [3] NCCL INFO NET/OFI [3] getCudaPath dev 0 busId 0000:00:1b.0 path /sys/devices/pci0000:00
algo-1:465:556 [3] NCCL INFO NET/OFI [3] getCudaPath dev 1 busId 0000:00:1c.0 path /sys/devices/pci0000:00/
algo-1:464:558 [2] NCCL INFO NET/OFI [2] getCudaPath dev 0 busId 0000:00:1b.0 path /sys/devices/pci0000:00
algo-1:464:558 [2] NCCL INFO NET/OFI [2] getCudaPath dev 1 busId 0000:00:1c.0 path /sys/devices/pci0000:00/
algo-1:463:557 [1] NCCL INFO NET/OFI [1] getCudaPath dev 0 busId 0000:00:1b.0 path /sys/devices/pci0000:00/
algo-1:463:557 [1] NCCL INFO NET/OFI [1] getCudaPath dev 1 busId 0000:00:1c.0 path /sys/devices/pci0000:00
algo-1:462:555 [0] NCCL INFO NET/OFI [0] getCudaPath dev 0 busId 0000:00:1b.0 path /sys/devices/pci0000:00/
algo-1:462:555 [0] NCCL INFO NET/OFI [0] getCudaPath dev 1 busId 0000:00:1c.0 path /sys/devices/pci0000:00
algo-1:463:557 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
algo-1:464:558 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
algo-1:465:556 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
algo-1:462:555 [0] NCCL INFO Channel 00/02 :    0   1   2   3   4   5   6   7
algo-1:462:555 [0] NCCL INFO Channel 01/02 :    0   1   2   3   4   5   6   7
algo-1:462:555 [0] NCCL INFO Trees [0] 1/4/-1->0->-1 [1] 1/-1/-1->0->4
algo-1:462:555 [0] NCCL INFO Channel 00 : 7[1e0] -> 0[1b0] [receive] via NET/AWS Libfabric/0
algo-1:464:558 [2] NCCL INFO Channel 00 : 2[1d0] -> 3[1e0] via direct shared memory
algo-1:463:557 [1] NCCL INFO Channel 00 : 1[1c0] -> 2[1d0] via direct shared memory
algo-1:464:558 [2] NCCL INFO Channel 01 : 2[1d0] -> 3[1e0] via direct shared memory
algo-1:463:557 [1] NCCL INFO Channel 01 : 1[1c0] -> 2[1d0] via direct shared memory
algo-1:465:556 [3] NCCL INFO Channel 00 : 3[1e0] -> 4[1b0] [send] via NET/AWS Libfabric/0
algo-1:465:556 [3] NCCL INFO Channel 01 : 3[1e0] -> 4[1b0] [send] via NET/AWS Libfabric/0
algo-1:464:558 [2] NCCL INFO Connected all rings
algo-1:462:555 [0] NCCL INFO Channel 01 : 7[1e0] -> 0[1b0] [receive] via NET/AWS Libfabric/0
algo-1:462:555 [0] NCCL INFO Channel 00 : 0[1b0] -> 1[1c0] via direct shared memory
algo-1:462:555 [0] NCCL INFO Channel 01 : 0[1b0] -> 1[1c0] via direct shared memory
algo-1:463:557 [1] NCCL INFO Connected all rings
algo-1:464:558 [2] NCCL INFO Channel 00 : 2[1d0] -> 1[1c0] via direct shared memory
algo-1:464:558 [2] NCCL INFO Channel 01 : 2[1d0] -> 1[1c0] via direct shared memory
algo-1:463:557 [1] NCCL INFO Channel 00 : 1[1c0] -> 0[1b0] via direct shared memory
algo-1:463:557 [1] NCCL INFO Channel 01 : 1[1c0] -> 0[1b0] via direct shared memory
libfabric:465:1673041637:efa:cq:rxr_cq_write_tx_error():243<warn> rxr_cq_write_tx_error: err: 21, prov_err: Unknown error -21 (21)
algo-1:465:556 [3] ofi_process_cq:1033 NCCL WARN NET/OFI Request 0x7f6390394d18 completed with error. RC: 21. Error: unknown error. Completed length: 0, Request: { buffer_index: 255, dev: 0, size: 0, state: CREATED, direction: SEND }

I see the same error on the algo-2 instance as well.

Pytorch version and helper output by mmaction2:

2023-01-06 21:47:12,614 - mmaction - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10) [GCC 10.3.0]
CUDA available: True
GPU 0,1,2,3: Tesla T4
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.3, V11.3.109
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 1.12.1+cu113
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.3.2  (built against CUDA 11.5)
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=ON, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,
TorchVision: 0.13.1+cu113
OpenCV: 4.6.0
MMCV: 1.7.0
MMCV Compiler: GCC 9.3
MMCV CUDA Compiler: 11.3
MMAction2: 0.24.1+
------------------------------------------------------------
2023-01-06 21:47:12,614 - mmaction - INFO - Distributed training: True

I noticed that you are using G4Dn instances. Unfortunately, plugin doesn't support g4dn platform. We only support p3dn and p4d platforms.

Looking at the libfabric error,

libfabric:465:1673041637:efa:cq:rxr_cq_write_tx_error():243<warn> rxr_cq_write_tx_error: err: 21, prov_err: Unknown error -21 (21)

@wzamazon Can you help reason out the error code?

@rashikakheria Thanks for the response.

After reaching out to AWS support, I found out that the security group must have an ingress and egress rule where it should allow all traffic if the destination has the same security group. This rule should be added even if the a rule such as

IP version | Type | Protocol | Port range | Destination
IPv4 | All traffic | All | All | 0.0.0.0/0

is present.
After adding the rule, the instances are able to communicate with each other using EFA.