aws/aws-ofi-nccl

Training performance on p3dn.24xlarge on Amazon Linux 2 is worse than on Ubuntu 20.04 (with and without EFA)

yukunlin opened this issue · 2 comments

Training performance on p3dn.24xlarge on Amazon Linux 2 is worse than on Ubuntu 20.04 (with and without EFA).

Overview of issue

I've observed that on p3dn.24xlarge instances, multi-node pytorch training jobs using EFA and aws-ofi-nccl has worse performance on AL2, compared to an equivalent setup on Ubuntu 20.04.

AMI EFA Enabled Throughput (wpm)
AL2 Deep Learning Base AMI Yes ~65000
AL2 Deep Learning Base AMI No ~45000
Ubuntu Deep Learning Base AMI Yes ~120000
Ubuntu Deep Learning Base AMI No ~110000

The reason I don't have numbers for Ubuntu with EFA is because of #107.

Repro steps

Common Setup

On both AL2 and Ubuntu 20.04 instances, we're using p3dn.24xlarge in the same VPC, in the same placement group (cluster strategy). EFA is enabled on the network interfaces, and I've verified that EFA drivers are installed.

Dockerfile used for training job:

Training Data download and pre-processing following https://github.com/pytorch/fairseq/blob/main/examples/language_model/README.md

Two node setup used for training job.

AL2 Setup

  • AMI: Deep Learning Base AMI (Amazon Linux 2) Version 52.0, (ami-07f6f7cc742921659 in us-west-2)

    • Nvidia driver version: 510.47.03

    • CUDA version: 11.6 (shouldn't be a factor because we're running the job from docker)

    • /opt/amazon/efa/bin/fi_info -p efa:

      provider: efa
           fabric: EFA-fe80::58:92ff:fe3f:a7a3
           domain: rdmap0s6-rdm
           version: 114.0
           type: FI_EP_RDM
           protocol: FI_PROTO_EFA
      provider: efa
          fabric: EFA-fe80::58:92ff:fe3f:a7a3
          domain: rdmap0s6-dgrm
          version: 114.0
          type: FI_EP_DGRAM
          protocol: FI_PROTO_EFA
      
  • nvidia-docker version: 20.10.7

Training command (EFA enabled) (executed on both training nodes):

nvidia-docker run \
   --mount type=bind,src=/mnt/fsx,dst=/job \
   --network host \
   --device /dev/infiniband/uverbs0 \
   --env NCCL_SOCKET_IFNAME=eth0 \
   --env FI_PROVIDER=efa \
   --env LOGLEVEL=INFO \
   --env NCCL_PROTO=simple \
   --env NCCL_DEBUG=INFO \
   919560170281.dkr.ecr.us-west-2.amazonaws.com/yukunlin-test-snap:manual_docker5 \
   python -m torch.distributed.launch --rdzv_backend c10d --rdzv_endpoint $MASTER_IP:29500 --rdzv_id 'foobar' --nnodes 2 --nproc_per_node 8 --role '' \
   fairseq_train_wrapped \
   --task language_modeling \
  /job/fairseq/data-bin/wikitext-103 \
  --save-dir "/job/fairseq/checkpoints/transformer_wikitext-103_manual_docker_al2" \
  --arch transformer_lm --share-decoder-input-output-embed \
  --dropout 0.1 \
  --optimizer adam --adam-betas '(0.9, 0.98)' --weight-decay 0.01 --clip-norm 0.0 \
  --lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07 \
  --tokens-per-sample 512 --sample-break-mode none \
  --max-tokens 2048 \
  --max-update 50000 2>&1 | tee ~/output_al2_efa.txt

Note that we're using the --device /dev/infiniband/uverbs0 flag to pass through EFA. For the non-EFA run, we drop the arguments --device /dev/infiniband/uverbs0 --env FI_PROVIDER=efa.

Ubuntu Setup

  • AMI: AWS Deep Learning Base AMI GPU CUDA 11 (Ubuntu 20.04) 20220403, (ami-061dac75dbd529aef in us-west-2)
    • Nvidia driver version: 510.47.03

    • CUDA version: 11.6 (shouldn't be a factor because we're running the job from docker)

    • /opt/amazon/efa/bin/fi_info -p efa:

      provider: efa
          fabric: EFA-fe80::da:b9ff:fe04:8af
          domain: rdmap0s6-rdm
          version: 114.10
          type: FI_EP_RDM
          protocol: FI_PROTO_EFA
      provider: efa
          fabric: EFA-fe80::da:b9ff:fe04:8af
          domain: rdmap0s6-dgrm
          version: 114.10
          type: FI_EP_DGRAM
          protocol: FI_PROTO_EFA
      
    • nvidia-docker version: 20.10.14

Training command (EFA enabled) (executed on both training nodes):

nvidia-docker run \
   --mount type=bind,src=/home/ubuntu/ps-fsx,dst=/job \
   --network host \
   --device /dev/infiniband/uverbs0 \
   --env FI_PROVIDER=EFA \
   --env NCCL_SOCKET_IFNAME=ens5 \
   --env LOGLEVEL=INFO \
   --env NCCL_PROTO=simple \
   --env NCCL_DEBUG=INFO \
   --ulimit memlock=-1 \
   919560170281.dkr.ecr.us-west-2.amazonaws.com/yukunlin-test-snap:manual_docker5 \
   python -m torch.distributed.launch --rdzv_backend c10d --rdzv_endpoint $MASTER_IP:29500 --rdzv_id 'foobar' --nnodes 2 --nproc_per_node 8 --role '' \
   --master_addr=$MASTER_IP --master_port=12345 \
   fairseq_train_wrapped \
   --task language_modeling \
  /job/fairseq/data-bin/wikitext-103 \
  --save-dir "/job/fairseq/checkpoints/transformer_wikitext-103_ubuntu_efa" \
  --arch transformer_lm --share-decoder-input-output-embed \
  --dropout 0.1 \
  --optimizer adam --adam-betas '(0.9, 0.98)' --weight-decay 0.01 --clip-norm 0.0 \
  --lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07 \
  --tokens-per-sample 512 --sample-break-mode none \
  --max-tokens 2048 \
  --max-update 50000 2>&1 | tee ~/output_ubuntu_efa.txt

Compared to the AL2 run, we add --ulimit memlock=-1 due to #107. Note that adding the same flag to the equivalent AL2 run makes no difference in performance.

For the non-EFA run, we drop the arguments --device /dev/infiniband/uverbs0 --env FI_PROVIDER=efa.

Results

AMI EFA Enabled Throughput (wpm)
AL2 Deep Learning Base AMI Yes ~65000
AL2 Deep Learning Base AMI No ~45000
Ubuntu Deep Learning Base AMI Yes ~120000
Ubuntu Deep Learning Base AMI No ~110000

Run Logs

AL2 (with EFA)

Interesting bits

[0]:ip-10-0-0-175:16:16 [0] NCCL INFO Bootstrap : Using eth0:10.0.0.175<0>
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO NET/OFI Forcing AWS OFI ndev 4
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO NET/OFI Selected Provider is efa
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO Using network AWS Libfabric
[0]:NCCL version 2.10.3+cuda11.3

Full log: https://gist.github.com/yukunlin/634c600a11e36d1384215ab08366e774

AL2 (no EFA)

Initialization:

[0]:ip-10-0-0-175:17:17 [0] NCCL INFO Bootstrap : Using eth0:10.0.0.175<0>
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[0]:
[0]:ip-10-0-0-175:17:17 [0] ofi_init:1157 NCCL WARN NET/OFI Only EFA provider is supported
[0]:
[0]:ip-10-0-0-175:17:17 [0] ofi_init:1208 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO NET/IB : No device found.
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.175<0>
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO Using network Socket
[0]:NCCL version 2.10.3+cuda11.3

Full log: https://gist.github.com/yukunlin/8c4298450299b33dd9a4c0559f50eccc

Ubuntu (EFA)

Initialization:

ip-10-0-0-115:74:74 [0] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0>
ip-10-0-0-115:74:74 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-0-0-115:74:74 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
ip-10-0-0-115:74:74 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-0-0-115:74:74 [0] NCCL INFO NET/OFI Forcing AWS OFI ndev 4
ip-10-0-0-115:74:74 [0] NCCL INFO NET/OFI Selected Provider is efa
ip-10-0-0-115:74:74 [0] NCCL INFO Using network AWS Libfabric
NCCL version 2.10.3+cuda11.3

Full logs: https://gist.github.com/yukunlin/ba8e41131abc1a7e4fb288b480d94b8f

Ubuntu (no EFA)

Initialization:

ip-10-0-0-115:73:73 [0] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0>
ip-10-0-0-115:73:73 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-0-0-115:73:73 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
ip-10-0-0-115:73:73 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1

ip-10-0-0-115:73:73 [0] ofi_init:1157 NCCL WARN NET/OFI Only EFA provider is supported

ip-10-0-0-115:73:73 [0] ofi_init:1208 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
ip-10-0-0-115:73:73 [0] NCCL INFO NET/IB : No device found.
ip-10-0-0-115:73:73 [0] NCCL INFO NET/Socket : Using [0]ens5:10.0.0.115<0>
ip-10-0-0-115:73:73 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3

Full log: https://gist.github.com/yukunlin/95a1036dba1c3a677f8f130e6cf23fbf

As #107 is now resolved, I've updated the issue with EFA results for ubuntu. The good news is that EFA does increase performance on both AL2 and Ubuntu benchmark runs.

If the maintainers feel that this is more of an AL2 issue than an aws-ofi-nccl issue, let me know where I can redirect this issue.

This is quite silly; I root caused it to the launch template set up.

The AL2 launch template was initially created using p3.2xlarge instances, via the AWS web console. This also lead to CPU options being set to 4 cores and 2 threads per core.

Screen Shot 2022-04-20 at 12 03 34 AM

When I created a new version of the AL2 launch template via the web console to use p3dn.24xlarge instances, the CPU options wasn't updated automatically, and was stuck at 4 cores and 2 threads per core (despite p3dn.24xlarge having 48 cores). Removing the CPU options override from the launch template fixed the slow performance.

In contrast, the Ubuntu launch template was created initially with p3dn.24xlarge, so it didn't have the misconfiguration of CPU options.

To summarize, the observed performance drop had nothing to do with:

  • AMI and EFA
  • Docker image (there wasn't a need to build from scratch, the AWS vended DL container images performed the same as the image built from scratch.

The root cause is a misconfigured launch template due to the behavior of the web console's UX (it's completely unintuitive for it to lock the CPU options even after the instance type is change)