Training performance on p3dn.24xlarge on Amazon Linux 2 is worse than on Ubuntu 20.04 (with and without EFA)

Question

Training performance on p3dn.24xlarge on Amazon Linux 2 is worse than on Ubuntu 20.04 (with and without EFA)

yukunlin opened this issue 3 years ago · 2 comments

Training performance on p3dn.24xlarge on Amazon Linux 2 is worse than on Ubuntu 20.04 (with and without EFA).

Overview of issue

I've observed that on p3dn.24xlarge instances, multi-node pytorch training jobs using EFA and aws-ofi-nccl has worse performance on AL2, compared to an equivalent setup on Ubuntu 20.04.

AMI	EFA Enabled	Throughput (wpm)
AL2 Deep Learning Base AMI	Yes	~65000
AL2 Deep Learning Base AMI	No	~45000
Ubuntu Deep Learning Base AMI	Yes	~120000
Ubuntu Deep Learning Base AMI	No	~110000

~~The reason I don't have numbers for Ubuntu with EFA is because of #107~~.

Repro steps

Common Setup

On both AL2 and Ubuntu 20.04 instances, we're using p3dn.24xlarge in the same VPC, in the same placement group (cluster strategy). EFA is enabled on the network interfaces, and I've verified that EFA drivers are installed.

Dockerfile used for training job:

https://gist.github.com/yukunlin/382aa9aefc88e2cf093278a1fd42a1ce

fairseq_train_wrapped (built into the Dockerfile, a workaround from facebookresearch/fairseq#4302 to set rank):

#!/usr/bin/python
# -*- coding: utf-8 -*-
import os
import sys
from fairseq_cli.train import cli_main
if __name__ == "__main__":
   sys.argv += ["--local_rank", os.getenv("LOCAL_RANK")]
   sys.exit(cli_main())

CUDA version: 11.3

Training Data download and pre-processing following https://github.com/pytorch/fairseq/blob/main/examples/language_model/README.md

Two node setup used for training job.

AL2 Setup

AMI: Deep Learning Base AMI (Amazon Linux 2) Version 52.0, (ami-07f6f7cc742921659 in us-west-2)

Nvidia driver version: 510.47.03
CUDA version: 11.6 (shouldn't be a factor because we're running the job from docker)

/opt/amazon/efa/bin/fi_info -p efa:

provider: efa
     fabric: EFA-fe80::58:92ff:fe3f:a7a3
     domain: rdmap0s6-rdm
     version: 114.0
     type: FI_EP_RDM
     protocol: FI_PROTO_EFA
provider: efa
    fabric: EFA-fe80::58:92ff:fe3f:a7a3
    domain: rdmap0s6-dgrm
    version: 114.0
    type: FI_EP_DGRAM
    protocol: FI_PROTO_EFA

nvidia-docker version: 20.10.7

Training command (EFA enabled) (executed on both training nodes):

nvidia-docker run \
   --mount type=bind,src=/mnt/fsx,dst=/job \
   --network host \
   --device /dev/infiniband/uverbs0 \
   --env NCCL_SOCKET_IFNAME=eth0 \
   --env FI_PROVIDER=efa \
   --env LOGLEVEL=INFO \
   --env NCCL_PROTO=simple \
   --env NCCL_DEBUG=INFO \
   919560170281.dkr.ecr.us-west-2.amazonaws.com/yukunlin-test-snap:manual_docker5 \
   python -m torch.distributed.launch --rdzv_backend c10d --rdzv_endpoint $MASTER_IP:29500 --rdzv_id 'foobar' --nnodes 2 --nproc_per_node 8 --role '' \
   fairseq_train_wrapped \
   --task language_modeling \
  /job/fairseq/data-bin/wikitext-103 \
  --save-dir "/job/fairseq/checkpoints/transformer_wikitext-103_manual_docker_al2" \
  --arch transformer_lm --share-decoder-input-output-embed \
  --dropout 0.1 \
  --optimizer adam --adam-betas '(0.9, 0.98)' --weight-decay 0.01 --clip-norm 0.0 \
  --lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07 \
  --tokens-per-sample 512 --sample-break-mode none \
  --max-tokens 2048 \
  --max-update 50000 2>&1 | tee ~/output_al2_efa.txt

Note that we're using the --device /dev/infiniband/uverbs0 flag to pass through EFA. For the non-EFA run, we drop the arguments --device /dev/infiniband/uverbs0 --env FI_PROVIDER=efa.

Ubuntu Setup

AMI: AWS Deep Learning Base AMI GPU CUDA 11 (Ubuntu 20.04) 20220403, (ami-061dac75dbd529aef in us-west-2)

Nvidia driver version: 510.47.03
CUDA version: 11.6 (shouldn't be a factor because we're running the job from docker)

/opt/amazon/efa/bin/fi_info -p efa:

provider: efa
    fabric: EFA-fe80::da:b9ff:fe04:8af
    domain: rdmap0s6-rdm
    version: 114.10
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
provider: efa
    fabric: EFA-fe80::da:b9ff:fe04:8af
    domain: rdmap0s6-dgrm
    version: 114.10
    type: FI_EP_DGRAM
    protocol: FI_PROTO_EFA

nvidia-docker version: 20.10.14

Training command (EFA enabled) (executed on both training nodes):

nvidia-docker run \
   --mount type=bind,src=/home/ubuntu/ps-fsx,dst=/job \
   --network host \
   --device /dev/infiniband/uverbs0 \
   --env FI_PROVIDER=EFA \
   --env NCCL_SOCKET_IFNAME=ens5 \
   --env LOGLEVEL=INFO \
   --env NCCL_PROTO=simple \
   --env NCCL_DEBUG=INFO \
   --ulimit memlock=-1 \
   919560170281.dkr.ecr.us-west-2.amazonaws.com/yukunlin-test-snap:manual_docker5 \
   python -m torch.distributed.launch --rdzv_backend c10d --rdzv_endpoint $MASTER_IP:29500 --rdzv_id 'foobar' --nnodes 2 --nproc_per_node 8 --role '' \
   --master_addr=$MASTER_IP --master_port=12345 \
   fairseq_train_wrapped \
   --task language_modeling \
  /job/fairseq/data-bin/wikitext-103 \
  --save-dir "/job/fairseq/checkpoints/transformer_wikitext-103_ubuntu_efa" \
  --arch transformer_lm --share-decoder-input-output-embed \
  --dropout 0.1 \
  --optimizer adam --adam-betas '(0.9, 0.98)' --weight-decay 0.01 --clip-norm 0.0 \
  --lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07 \
  --tokens-per-sample 512 --sample-break-mode none \
  --max-tokens 2048 \
  --max-update 50000 2>&1 | tee ~/output_ubuntu_efa.txt

Compared to the AL2 run, we add --ulimit memlock=-1 due to #107. Note that adding the same flag to the equivalent AL2 run makes no difference in performance.

For the non-EFA run, we drop the arguments --device /dev/infiniband/uverbs0 --env FI_PROVIDER=efa.

Results

AMI	EFA Enabled	Throughput (wpm)
AL2 Deep Learning Base AMI	Yes	~65000
AL2 Deep Learning Base AMI	No	~45000
Ubuntu Deep Learning Base AMI	Yes	~120000
Ubuntu Deep Learning Base AMI	No	~110000

Run Logs

AL2 (with EFA)

Interesting bits

[0]:ip-10-0-0-175:16:16 [0] NCCL INFO Bootstrap : Using eth0:10.0.0.175<0>
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO NET/OFI Forcing AWS OFI ndev 4
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO NET/OFI Selected Provider is efa
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO Using network AWS Libfabric
[0]:NCCL version 2.10.3+cuda11.3

Full log: https://gist.github.com/yukunlin/634c600a11e36d1384215ab08366e774

AL2 (no EFA)

Initialization:

[0]:ip-10-0-0-175:17:17 [0] NCCL INFO Bootstrap : Using eth0:10.0.0.175<0>
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[0]:
[0]:ip-10-0-0-175:17:17 [0] ofi_init:1157 NCCL WARN NET/OFI Only EFA provider is supported
[0]:
[0]:ip-10-0-0-175:17:17 [0] ofi_init:1208 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO NET/IB : No device found.
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.175<0>
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO Using network Socket
[0]:NCCL version 2.10.3+cuda11.3

Full log: https://gist.github.com/yukunlin/8c4298450299b33dd9a4c0559f50eccc

Ubuntu (EFA)

Initialization:

ip-10-0-0-115:74:74 [0] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0>
ip-10-0-0-115:74:74 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-0-0-115:74:74 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
ip-10-0-0-115:74:74 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-0-0-115:74:74 [0] NCCL INFO NET/OFI Forcing AWS OFI ndev 4
ip-10-0-0-115:74:74 [0] NCCL INFO NET/OFI Selected Provider is efa
ip-10-0-0-115:74:74 [0] NCCL INFO Using network AWS Libfabric
NCCL version 2.10.3+cuda11.3

Full logs: https://gist.github.com/yukunlin/ba8e41131abc1a7e4fb288b480d94b8f

Ubuntu (no EFA)

Initialization:

ip-10-0-0-115:73:73 [0] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0>
ip-10-0-0-115:73:73 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-0-0-115:73:73 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
ip-10-0-0-115:73:73 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1

ip-10-0-0-115:73:73 [0] ofi_init:1157 NCCL WARN NET/OFI Only EFA provider is supported

ip-10-0-0-115:73:73 [0] ofi_init:1208 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
ip-10-0-0-115:73:73 [0] NCCL INFO NET/IB : No device found.
ip-10-0-0-115:73:73 [0] NCCL INFO NET/Socket : Using [0]ens5:10.0.0.115<0>
ip-10-0-0-115:73:73 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3

Full log: https://gist.github.com/yukunlin/95a1036dba1c3a677f8f130e6cf23fbf

Answer 1 · 2022-04-20T00:01:17.000Z

As #107 is now resolved, I've updated the issue with EFA results for ubuntu. The good news is that EFA does increase performance on both AL2 and Ubuntu benchmark runs.

If the maintainers feel that this is more of an AL2 issue than an aws-ofi-nccl issue, let me know where I can redirect this issue.

Answer 2 · 2022-04-20T21:10:40.000Z

This is quite silly; I root caused it to the launch template set up.

The AL2 launch template was initially created using p3.2xlarge instances, via the AWS web console. This also lead to CPU options being set to 4 cores and 2 threads per core.

When I created a new version of the AL2 launch template via the web console to use p3dn.24xlarge instances, the CPU options wasn't updated automatically, and was stuck at 4 cores and 2 threads per core (despite p3dn.24xlarge having 48 cores). Removing the CPU options override from the launch template fixed the slow performance.

In contrast, the Ubuntu launch template was created initially with p3dn.24xlarge, so it didn't have the misconfiguration of CPU options.

To summarize, the observed performance drop had nothing to do with:

AMI and EFA
Docker image (there wasn't a need to build from scratch, the AWS vended DL container images performed the same as the image built from scratch.

The root cause is a misconfigured launch template due to the behavior of the web console's UX (it's completely unintuitive for it to lock the CPU options even after the instance type is change)