Training performance on p3dn.24xlarge on Amazon Linux 2 is worse than on Ubuntu 20.04 (with and without EFA)
yukunlin opened this issue · 2 comments
Training performance on p3dn.24xlarge on Amazon Linux 2 is worse than on Ubuntu 20.04 (with and without EFA).
Overview of issue
I've observed that on p3dn.24xlarge
instances, multi-node pytorch training jobs using EFA and aws-ofi-nccl
has worse performance on AL2, compared to an equivalent setup on Ubuntu 20.04.
AMI | EFA Enabled | Throughput (wpm) |
---|---|---|
AL2 Deep Learning Base AMI | Yes | ~65000 |
AL2 Deep Learning Base AMI | No | ~45000 |
Ubuntu Deep Learning Base AMI | Yes | ~120000 |
Ubuntu Deep Learning Base AMI | No | ~110000 |
The reason I don't have numbers for Ubuntu with EFA is because of #107.
Repro steps
Common Setup
On both AL2 and Ubuntu 20.04 instances, we're using p3dn.24xlarge
in the same VPC, in the same placement group (cluster strategy). EFA is enabled on the network interfaces, and I've verified that EFA drivers are installed.
Dockerfile used for training job:
- https://gist.github.com/yukunlin/382aa9aefc88e2cf093278a1fd42a1ce
-
fairseq_train_wrapped
(built into the Dockerfile, a workaround from facebookresearch/fairseq#4302 to set rank):#!/usr/bin/python # -*- coding: utf-8 -*- import os import sys from fairseq_cli.train import cli_main if __name__ == "__main__": sys.argv += ["--local_rank", os.getenv("LOCAL_RANK")] sys.exit(cli_main())
-
CUDA version: 11.3
-
Training Data download and pre-processing following https://github.com/pytorch/fairseq/blob/main/examples/language_model/README.md
Two node setup used for training job.
AL2 Setup
-
AMI: Deep Learning Base AMI (Amazon Linux 2) Version 52.0, (
ami-07f6f7cc742921659
inus-west-2
)-
Nvidia driver version: 510.47.03
-
CUDA version: 11.6 (shouldn't be a factor because we're running the job from docker)
-
/opt/amazon/efa/bin/fi_info -p efa
:provider: efa fabric: EFA-fe80::58:92ff:fe3f:a7a3 domain: rdmap0s6-rdm version: 114.0 type: FI_EP_RDM protocol: FI_PROTO_EFA provider: efa fabric: EFA-fe80::58:92ff:fe3f:a7a3 domain: rdmap0s6-dgrm version: 114.0 type: FI_EP_DGRAM protocol: FI_PROTO_EFA
-
-
nvidia-docker
version: 20.10.7
Training command (EFA enabled) (executed on both training nodes):
nvidia-docker run \
--mount type=bind,src=/mnt/fsx,dst=/job \
--network host \
--device /dev/infiniband/uverbs0 \
--env NCCL_SOCKET_IFNAME=eth0 \
--env FI_PROVIDER=efa \
--env LOGLEVEL=INFO \
--env NCCL_PROTO=simple \
--env NCCL_DEBUG=INFO \
919560170281.dkr.ecr.us-west-2.amazonaws.com/yukunlin-test-snap:manual_docker5 \
python -m torch.distributed.launch --rdzv_backend c10d --rdzv_endpoint $MASTER_IP:29500 --rdzv_id 'foobar' --nnodes 2 --nproc_per_node 8 --role '' \
fairseq_train_wrapped \
--task language_modeling \
/job/fairseq/data-bin/wikitext-103 \
--save-dir "/job/fairseq/checkpoints/transformer_wikitext-103_manual_docker_al2" \
--arch transformer_lm --share-decoder-input-output-embed \
--dropout 0.1 \
--optimizer adam --adam-betas '(0.9, 0.98)' --weight-decay 0.01 --clip-norm 0.0 \
--lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07 \
--tokens-per-sample 512 --sample-break-mode none \
--max-tokens 2048 \
--max-update 50000 2>&1 | tee ~/output_al2_efa.txt
Note that we're using the --device /dev/infiniband/uverbs0
flag to pass through EFA. For the non-EFA run, we drop the arguments --device /dev/infiniband/uverbs0 --env FI_PROVIDER=efa
.
Ubuntu Setup
- AMI: AWS Deep Learning Base AMI GPU CUDA 11 (Ubuntu 20.04) 20220403, (
ami-061dac75dbd529aef
inus-west-2
)-
Nvidia driver version: 510.47.03
-
CUDA version: 11.6 (shouldn't be a factor because we're running the job from docker)
-
/opt/amazon/efa/bin/fi_info -p efa
:provider: efa fabric: EFA-fe80::da:b9ff:fe04:8af domain: rdmap0s6-rdm version: 114.10 type: FI_EP_RDM protocol: FI_PROTO_EFA provider: efa fabric: EFA-fe80::da:b9ff:fe04:8af domain: rdmap0s6-dgrm version: 114.10 type: FI_EP_DGRAM protocol: FI_PROTO_EFA
-
nvidia-docker
version: 20.10.14
-
Training command (EFA enabled) (executed on both training nodes):
nvidia-docker run \
--mount type=bind,src=/home/ubuntu/ps-fsx,dst=/job \
--network host \
--device /dev/infiniband/uverbs0 \
--env FI_PROVIDER=EFA \
--env NCCL_SOCKET_IFNAME=ens5 \
--env LOGLEVEL=INFO \
--env NCCL_PROTO=simple \
--env NCCL_DEBUG=INFO \
--ulimit memlock=-1 \
919560170281.dkr.ecr.us-west-2.amazonaws.com/yukunlin-test-snap:manual_docker5 \
python -m torch.distributed.launch --rdzv_backend c10d --rdzv_endpoint $MASTER_IP:29500 --rdzv_id 'foobar' --nnodes 2 --nproc_per_node 8 --role '' \
--master_addr=$MASTER_IP --master_port=12345 \
fairseq_train_wrapped \
--task language_modeling \
/job/fairseq/data-bin/wikitext-103 \
--save-dir "/job/fairseq/checkpoints/transformer_wikitext-103_ubuntu_efa" \
--arch transformer_lm --share-decoder-input-output-embed \
--dropout 0.1 \
--optimizer adam --adam-betas '(0.9, 0.98)' --weight-decay 0.01 --clip-norm 0.0 \
--lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07 \
--tokens-per-sample 512 --sample-break-mode none \
--max-tokens 2048 \
--max-update 50000 2>&1 | tee ~/output_ubuntu_efa.txt
Compared to the AL2 run, we add --ulimit memlock=-1
due to #107. Note that adding the same flag to the equivalent AL2 run makes no difference in performance.
For the non-EFA run, we drop the arguments --device /dev/infiniband/uverbs0 --env FI_PROVIDER=efa
.
Results
AMI | EFA Enabled | Throughput (wpm) |
---|---|---|
AL2 Deep Learning Base AMI | Yes | ~65000 |
AL2 Deep Learning Base AMI | No | ~45000 |
Ubuntu Deep Learning Base AMI | Yes | ~120000 |
Ubuntu Deep Learning Base AMI | No | ~110000 |
Run Logs
AL2 (with EFA)
Interesting bits
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO Bootstrap : Using eth0:10.0.0.175<0>
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO NET/OFI Forcing AWS OFI ndev 4
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO NET/OFI Selected Provider is efa
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO Using network AWS Libfabric
[0]:NCCL version 2.10.3+cuda11.3
Full log: https://gist.github.com/yukunlin/634c600a11e36d1384215ab08366e774
AL2 (no EFA)
Initialization:
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO Bootstrap : Using eth0:10.0.0.175<0>
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[0]:
[0]:ip-10-0-0-175:17:17 [0] ofi_init:1157 NCCL WARN NET/OFI Only EFA provider is supported
[0]:
[0]:ip-10-0-0-175:17:17 [0] ofi_init:1208 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO NET/IB : No device found.
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.175<0>
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO Using network Socket
[0]:NCCL version 2.10.3+cuda11.3
Full log: https://gist.github.com/yukunlin/8c4298450299b33dd9a4c0559f50eccc
Ubuntu (EFA)
Initialization:
ip-10-0-0-115:74:74 [0] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0>
ip-10-0-0-115:74:74 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-0-0-115:74:74 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
ip-10-0-0-115:74:74 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-0-0-115:74:74 [0] NCCL INFO NET/OFI Forcing AWS OFI ndev 4
ip-10-0-0-115:74:74 [0] NCCL INFO NET/OFI Selected Provider is efa
ip-10-0-0-115:74:74 [0] NCCL INFO Using network AWS Libfabric
NCCL version 2.10.3+cuda11.3
Full logs: https://gist.github.com/yukunlin/ba8e41131abc1a7e4fb288b480d94b8f
Ubuntu (no EFA)
Initialization:
ip-10-0-0-115:73:73 [0] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0>
ip-10-0-0-115:73:73 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-0-0-115:73:73 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
ip-10-0-0-115:73:73 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-0-0-115:73:73 [0] ofi_init:1157 NCCL WARN NET/OFI Only EFA provider is supported
ip-10-0-0-115:73:73 [0] ofi_init:1208 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
ip-10-0-0-115:73:73 [0] NCCL INFO NET/IB : No device found.
ip-10-0-0-115:73:73 [0] NCCL INFO NET/Socket : Using [0]ens5:10.0.0.115<0>
ip-10-0-0-115:73:73 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
Full log: https://gist.github.com/yukunlin/95a1036dba1c3a677f8f130e6cf23fbf
As #107 is now resolved, I've updated the issue with EFA results for ubuntu. The good news is that EFA does increase performance on both AL2 and Ubuntu benchmark runs.
If the maintainers feel that this is more of an AL2 issue than an aws-ofi-nccl issue, let me know where I can redirect this issue.
This is quite silly; I root caused it to the launch template set up.
The AL2 launch template was initially created using p3.2xlarge
instances, via the AWS web console. This also lead to CPU options being set to 4 cores and 2 threads per core.
When I created a new version of the AL2 launch template via the web console to use p3dn.24xlarge
instances, the CPU options wasn't updated automatically, and was stuck at 4 cores and 2 threads per core (despite p3dn.24xlarge
having 48 cores). Removing the CPU options override from the launch template fixed the slow performance.
In contrast, the Ubuntu launch template was created initially with p3dn.24xlarge
, so it didn't have the misconfiguration of CPU options.
To summarize, the observed performance drop had nothing to do with:
- AMI and EFA
- Docker image (there wasn't a need to build from scratch, the AWS vended DL container images performed the same as the image built from scratch.
The root cause is a misconfigured launch template due to the behavior of the web console's UX (it's completely unintuitive for it to lock the CPU options even after the instance type is change)