Tools to benchmark AWS network performance, focused on workloads encountered in neural network training.
Goal of these benchmarks is to track/identify bottlenecks that prevent efficient of large neural networks, such as data-parallel training of Megatron, which is a 300M parameter BERT model.
(tested on fresh instance with DLAMI 23)
conda create -y -n main python=3.6
source activate main
git clone https://github.com/cybertronai/aws-network-benchmarks
cd aws-network-benchmarks
pip install -r requirements.txt
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_DEFAULT_REGION=us-east-1
(optional, to save logs+graphs) export WANDB_API_KEY=<your key from https://app.wandb.ai/settings>
export NCLUSTER_ZONE=<some zone that contains p3dn instances>
Note: you can use "ncluster spot_prices p3dn" to see valid p3dn zones
To see if things work with a pair of small machines. This can take up to 10 when running first time on an account as infrastructure is created.
python mpi_test.py
This allocates 2 c5.large machines, sets up mpi between them and runs hostsname. You should see something like this when this works
rr> ip-172-31-10-25 -1 mpi_test.py --role=worker
rr> ip-172-31-3-70 -1 mpi_test.py --role=worker
To run nccl-test on p3dn instances, do this
python nccl_bench.py --num_tasks=2 --name=efatest
this launches machines named 0.efatest
and 1.efatest
to connect to 0.efatest and see logs
ncluster connect 0.efatest
or
ssh ec2-user@<ip of 0.efatest> -t tmux a
This test runs on image prepared using prepare_efa_image.py
script. Machines stay up indefinitely, kill using ncluster kill efatest
or through AWS EC2 console
Same as above, but use following:
python pytorch_bench.py --role=launcher --num_tasks=2 --mpirun=1 --do_efa=1
aws configure
pip install -r requirements.txt
<run benchmark>
Some benchmarks print result on console, for others, you need to SSH into the machine and look at sudo nload
to see network usage.
This builds latest NCCL and nccl-examples and runs allreduce benchmark.
For EFA test
export NCLUSTER_ZONE=us-east-1b
python nccl_multiversion.py --instance_type=p3dn.24xlarge --name=nccl-efa --image_name='dlami23-efa'
For Ethernet test
python nccl_multiversion.py --instance_type=p3.16xlarge --name=nccl-ethernet --image_name='Deep Learning AMI (Ubuntu) Version 22.0'
Current: EFA=1.35 Gbps, Ethernet= with 16 GPUs over 2 nodes pre-patch
issues:
python iperf_two_machines.py
# then ssh into machine and run `sudo nload`, hit Right to see load on ens5
Current: 91-93 Gbps with 8 processes/10 connections each
python pytorch_bench.py --role=launcher
Issues
Current:
- using NCCL 2.3.7: 22.7 Gbps
- using NCCL 2.4+: 9.4 Gbps
python ray_two_machines_bench.py
Current: 45.5 Gbps
Issues: