The official code for the paper ''Decentralized Asynchronous Non Convex Stochastic Optimization on Directed Graphs'' [Accepted to IEEE Transactions on Control of Network Systems, Oct 2022]
First, you need to run the train script. When the train script has been successfully finished, you need to run the test scripts.
- Training script Here is one example to run train scripts code for cifar10, 2nodes using cpu:
#SBATCH --time=70:00:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=80G
#SBATCH --partition=cpulong
#SBATCH --job-name=asysonatampi2nodecifar
#SBATCH --err=TRAIN_cifar10_2nodes.err
#SBATCH --out=TRAIN_cifar10_2nodes.out
ml TensorFlow/2.2.0-fosscuda-2019b-Python-3.7.4
ml matplotlib/3.1.1-fosscuda-2019b-Python-3.7.4
ml SciPy-bundle/2019.10-fosscuda-2019b-Python-3.7.4
dir='../../../save_async_step_reduce/cifar/0.7/2'
if [ ! -e $dir ]; then
mkdir -p $dir
fi
mpiexec -np 2 python ../../../src/main.py --iter 35000 \
--step 0.2 \
--eps 0.0001 \
--model resnet_18 \
--dataset cifar \
--batch_size 128 \
--connectivity 0.7 \
--snapshot 30 \
--len_shards 200 \
--seed 0 \
--gpu 12 \
--lr_schedule step_reduce \
--lr_reduce_rate 10 \
--lr_iter_reduce 10000 \
--save_path ../../../save_async_step_reduce/ \
2>&1 | tee $dir'/train_logs.txt'
Parameter | Description |
---|---|
model |
The model architecture. Options:resnet_18 , cnn , lenet . Default = cnn . |
dataset |
Dataset to use. Options: mnist , cifar , fashion_mnist . Default = mnist . |
step |
The initial Learning rate or step size. Default = 0.1 . |
lr_schedule |
Learning rate reduction schedule. Options: diminishing , step_reduce . default = diminishing . |
batch-size |
Batch size, default = 100 . |
iter |
Number of local iterations each node perform, default = 1000 . |
eps |
Epsilon size for diminishing learning rate schedule, default = 0.0001 . |
connectivity |
Graph Connectivity percentage, default = 0.5 . |
snapshot |
Snapshots intervals in seconds, default = 30 . |
len_shards |
Length of saving shards. Default = 20 |
lr_reduce_rate |
The rate of learning rate reduction for step_reduce learning rate schedule, default = 10.0 . |
lr_iter_reduce |
The learning rate reduction intervals for step_reduce learning rate schedule, default = 10000 . |
gpu |
The id of gpu device if -1 it uses cpu, default = 0 . |
save_path |
Path to save snapshots, default = ../save/ . |
num_outneighbor |
Number of out-neighbors, default = 3 . |
seed |
The random seed for creating graph and partitioning dataset amongst nodes, default = 0 . |
- Test script Here is one example to run test scripts code for cifar10, 2nodes using cpu:
#!/bin/bash
#SBATCH --time=70:00:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=60G
#SBATCH --partition=cpulong
#SBATCH --job-name=asysonatampi2nodemnist
#SBATCH --err=TEST_cifar10_2nodes.err
#SBATCH --out=TEST_cifar10_2nodes.out
ml TensorFlow/2.2.0-fosscuda-2019b-Python-3.7.4
ml matplotlib/3.1.1-fosscuda-2019b-Python-3.7.4
ml SciPy-bundle/2019.10-fosscuda-2019b-Python-3.7.4
dir='../../../save_async_step_reduce/cifar/0.7/2'
python ../../../src/test_agent.py --num_nodes 2 \
--iter 35000 \
--model resnet_18 \
--dataset cifar \
--batch_size 200 \
--connectivity 0.7 \
--snapshot 30 \
--seed 0 \
--len_shards 200 \
--save_path ../../../save_async_step_reduce/ \
2>&1 | tee $dir'/test_logs.txt'
Parameter | Description |
---|---|
model |
The model architecture. Options:resnet_18 , cnn , lenet . Default = cnn . |
dataset |
Dataset to use. Options: mnist , cifar , fashion_mnist . Default = mnist . |
batch-size |
Batch size for test, default = 100 . |
iter |
Number of local iterations each node perform, default = 1000 . |
connectivity |
Graph Connectivity percentage, default = 0.5 . |
snapshot |
Snapshots intervals in seconds, default = 30 . |
len_shards |
Length of saving shards. Default = 20 |
gpu |
The id of gpu device if -1 it uses cpu, default = 0 . |
save_path |
Path to save snapshots, default = ../save/ . |
num_nodes |
Number of Nodes, default = 10 . |
seed |
The random seed for creating graph and partitioning dataset amongst nodes, default = 0 . |
The codes have beeing organized as follow:
- src: the src code which contains all the codes.
- train_scripts: all the training scripts. for each dataset, and connectivity and different number of nodes.
- test_scripts: all the test scripts. for each dataset, and connectivity and different number of nodes.
Please follow these step to use the code:
- Training: First, please run the train script for the corresponding dataset, connectivity and number of nodes. Please also make sure that the saving path of the results are correct. It means you need to check the
dir='../../../save_async_step_reduce/cifar/0.7/2'
,--save_path ../../../save_async_step_reduce/
in the train scripts. Also, you can modify the#Sbatch ....
to include gpu instead of running on cpu. To use gpu please put#SBATCH --partition=gpulong
and#SBATCH --gres=gpu:1
. To use CPU you only need to put#SBATCH --partition=cpulong
. - Testing: After the training has been successfully finished, please run the test script for the corresponding dataset, connectivity, and number of nodes. Please also make sure that the saving path of the results are correct. It means you need to check the
dir='../../../save_async_step_reduce/cifar/0.7/2'
,--save_path ../../../save_async_step_reduce/
in the test scripts and it is the same as what has been put in its corresponding train script.
- If you run the test script and you got an error of ... did not found, it means that the train script has not been successfully finished. When the train script or test script has been successfully finished,
Job Concluded
should exist in the logs files. - After running the test script, there will be a folder named
plot
in the save_path directory which contains the results. Other folders in the directory, they contain the snapshots for different nodes and they are raw saved snapshots. - Please also note that for cifar10, the snapshots size are huge and it may require more that 4TB of space. If there is not enough space, after running for several days the runs will be stucked! Also, the test scripts needed to be run over cpuextralong just to be safe. Sometimes the test scripts take about 1 week, especially for 8 nodes and higher.
- To reduce the size of saved snapshots, you can increase the snapshot intervals when running the training scripts by adjusting
snapshot
argument to 90 or more seconds.
Please cite our work if you find this repository and the paper useful.
@article{kungurtsev2021decentralized,
title={Decentralized Asynchronous Non-convex Stochastic Optimization on Directed Graphs},
author={Kungurtsev, Vyacheslav and Morafah, Mahdi and Javidi, Tara and Scutari, Gesualdo},
journal={arXiv preprint arXiv:2110.10406},
year={2021}
}
We used RCI clusters for running the experiments and the #Sbatch
parameters in the scripts are for the RCI clusters.
The access to the computational infrastructure of the OP VVV funded project CZ.02.1.01/0.0/0.0/16_019/0000765 “Research Center for Informatics” is also gratefully acknowledged for running
the experiments.