This repository contains the code used for paper:
This code was originally forked from the End to end ImageNet training.
- You can download ImageNet from Kaggle.
- You can download from our S3 bucket (s3://signum-majority-vote/dataset/ILSVRC.tar) (reproduction purpose only).
- Put the folder 'main/bit2byte-extension' to the directory of the PyTorch source code
- Execute this command on the directory of 'bit2byte-extension'
python setup.py install
You can find more information about C++ extension in PyTorch documentation
Note: You have to execute following commands in each instance.
Execute following commands on the directory of 'main'
ulimit -n 1000000
sudo /home/ubuntu/anaconda3/envs/fastai/bin/python3 -m torch.distributed.launch \ --nproc_per_node=1 --nnodes=[number of instances] --node_rank=[rank of instance] --master_addr="0.0.0.0" \ --master_port=1235 benchmark_main.py ~/ILSVRC/Data/CLS-LOC -a resnet50 -b 128 --lr 0.0001 \ --epochs 80 --save-dir ./ --world-size [number of instances] --print-freq 200 --compress --dist_backend gloo --weight-decay 1e-4 --momentum 0.9 --warm-up \ --dist-url [parameter sever's url]
ulimit -n 1000000
sudo /home/ubuntu/anaconda3/envs/fastai/bin/python3 -m torch.distributed.launch \ --nproc_per_node=1 --nnodes=[number of instances] --node_rank=[rank of instance] --master_addr="0.0.0.0" \ --master_port=1235 benchmark_main.py ~/ILSVRC/Data/CLS-LOC -a resnet50 -b 128 --lr 0.1 \ --epochs 80 --save-dir ./ --world-size [number of instances] --print-freq 200 --all_reduce --dist_backend nccl --weight-decay 0.1 --momentum 0.9 --warm-up \ --dist-url [parameter sever's url]
Execute following commands on the directory of 'benchmark/QRNN'
/home/ubuntu/anaconda3/envs/qrnn/bin/python3 -u -m torch.distributed.launch \ --nproc_per_node=1 --nnodes=[number of instances] --node_rank=[rank of instance] --master_addr="0.0.0.0" \ --master_port=1235 main_signum.py --epochs 12 --nlayers 4 --emsize 400 --nhid 2500 --alpha 0 --beta 0 \ --dropoute 0 --dropouth 0.1 --dropouti 0.1 --dropout 0.1 --wdrop 0 --wdecay 0 --bptt 140 --batch_size 240 \ --optimizer signum --lr 1e-3 --momentum 0.5 --data data/wikitext-103 --save WT103.12hr.QRNN.pt --when 12 --model QRNN \ --world-size [number of instances] --dist-url [parameter sever's url] \ --save-dir ./ --distributed --multi_gpu --momentun_warm_up
/home/ubuntu/anaconda3/envs/qrnn/bin/python3 -u -m torch.distributed.launch \ --nproc_per_node=1 --nnodes=[number of instances] --node_rank=[rank of instance] --master_addr="0.0.0.0" \ --master_port=1235 main_signum.py --epochs 12 --nlayers 4 --emsize 400 --nhid 2500 --alpha 0 --beta 0 \ --dropoute 0 --dropouth 0.1 --dropouti 0.1 --dropout 0.1 --wdrop 0 --wdecay 0 --bptt 140 --batch_size 240 \ --optimizer adam --lr 1e-3 --momentum 0.5 --data data/wikitext-103 --save WT103.12hr.QRNN.pt --when 12 --model QRNN \ --world-size [number of instances] --dist-url [parameter sever's url] \ --save-dir ./ --distributed --multi_gpu --momentun_warm_up
Execute following commands on the directory of 'benchmark/QSGD'
ulimit -n 1000000
sudo /home/ubuntu/anaconda3/envs/fastai/bin/python3 -m torch.distributed.launch \ --nproc_per_node=1 --nnodes=[number of instances] --node_rank=[rank of instance] --master_addr="0.0.0.0" \ --master_port=1235 benchmark_main.py ~/ILSVRC/Data/CLS-LOC -a resnet50 -b 128 --lr 1e-4 --seed 1 \ --epochs 90 --save-dir ./ --world-size [number of instances] --print-freq 50 \ --extra_epochs 0 --compress --signum --communication_method Signum \ --dist-url [parameter sever's url]
ulimit -n 1000000
sudo /home/ubuntu/anaconda3/envs/fastai/bin/python3 -m torch.distributed.launch \ --nproc_per_node=1 --nnodes=[number of instances] --node_rank=[rank of instance] --master_addr="0.0.0.0" \ --master_port=1235 benchmark_main.py ~/ILSVRC/Data/CLS-LOC -a resnet50 -b 128 --lr 0.1 --seed 1 \ --epochs 90 --save-dir ./ --world-size [number of instances] --print-freq 50 \ --extra_epochs 0 --compress --signum --communication_method QSGD --qsgd_level [the level of QSGD] [--enable_max, if enable max_norm] --all_reduce \ --dist-url [parameter sever's url]
Execute following commands on the directory of 'benchmark/Krum'
ulimit -n 1000000
sudo /home/ubuntu/anaconda3/envs/fastai/bin/python3 -m torch.distributed.launch \ --nproc_per_node=1 --nnodes=[number of instances] --node_rank=[rank of instance] --master_addr="0.0.0.0" \ --master_port=1235 benchmark_main.py ~/ILSVRC/Data/CLS-LOC -a resnet50 -b 128 --lr 1e-3 \ --epochs 90 --save-dir ./ --world-size 7 --print-freq 50 \ --extra_epochs 0 --compress --signum --communication_method Signum \ --enable_adversary --adversary_num [the number of adversaries] [--enable_minus_adversary, enable minus adversary or it will be random one] \ --dist-url [parameter sever's url]
ulimit -n 1000000
sudo /home/ubuntu/anaconda3/envs/fastai/bin/python3 -m torch.distributed.launch \ --nproc_per_node=1 --nnodes=[number of instances] --node_rank=[rank of instance] --master_addr="0.0.0.0" \ --master_port=1235 benchmark_main.py ~/ILSVRC/Data/CLS-LOC -a resnet50 -b 128 --lr 1e-1 \ --epochs 90 --save-dir ./ --world-size 7 --print-freq 50 \ --extra_epochs 0 --compress --signum --communication_method Signum \ --enable_krum --krum_f [the number of F] --enable_adversary --adversary_num [the number of adversaries] \ --dist-url [parameter sever's url]