Theano-MPI is a distributed framework for training deep learning models built in Theano based on data-parallelism. The data-parallelism is implemented in two ways: Bulk Synchronous Parallel and Elastic Averaging SGD. This project is an extension to theano_alexnet, aiming to scale up training framework to more than 8 GPUs and across nodes. Please see this technical report for an overview of implementation details.
It is compatible for training models built in different framework libraries, e.g., Lasagne, Keras, Blocks, as long as its model parameters can be exposed as a theano shared variable. See lib/base/models/ for details. Or you can build your own models from scratch using basic theano tensor operations and expose your model parameters as theano shared variable. See wiki for a tutorial on building customized neural networks.
- OpenMPI 1.8.7 or at least MPI-2 standard equivalent.
- mpi4py
- numpy
- Theano
- Pylearn2
- PyCUDA
- zeromq
- hickle
-
- ssh copper.sharcnet.ca
-
- ssh to one computing node e.g., cop3
-
- set .theanorc to the following:
[global]
mode = FAST_RUN
floatX = float32
base_compiledir = /home/USERNAME/.theano
[cuda]
root=/opt/sharcnet/cuda/7.0.28/toolkit
-
- cd into run/ and configure each section in the config.yaml. Configure the yaml file corresponding to the chosen model, e.g., alexnet.yaml, googlenet.yaml, vggnet.yaml or customized.yaml.
-
to start a BSP training session:
-
- In the "weight exchange" section in config.yaml, choose as follows:
sync_rule: BSP
-
- choose a parameter exchanging strategy from "ar", "asa32", "asa16" and "copper", where "ar" means using Allreduce() from mpi4py, "asa32" and "asa16" mean using the Alltoall-sum-Allgather strategy with float32 and float16 respectively, "copper" means using the binary reduction strategy designed for copper GPU topology.
-
- execute "./run_bsp_workers.sh N", in which N is the desired number of workers.
-
-
to start a EASGD training session:
-
- If you want to start server and workers in one communicator, configure config.yaml file as follows:
sync_rule: EASGD sync_start: True avg_freq: 2 or desired value
-
- check the example ./run_easgd_4w_sync_start.sh (or ./run_easgd_4w.sh if sync_start==False), decide how many workers you want to run and which hosts and GPUs you want to use for each worker and the server, make your customized run.sh script.
-
- execute your ./run.sh.
-
Preprocessed data (1000 catagory, 128 batchsize) is located at /work/mahe6562/prepdata/.
Make sure you have access to the data.
To get the best running speed performance, the memory cache may need to be cleaned before running.
To get deterministic and reproducible results, turn off all randomness in the config 'random' section and use cudaconvnet from pylearn2 instead of the indeterministic dnn.conv and dnn.pool from cuDNN.
###BSP Time per 5120 images in seconds: [allow_gc = True]
Model | 1GPU | 2GPU | 4GPU | 8GPU | 16GPU |
---|---|---|---|---|---|
AlexNet-128b | 31.20 | 15.65 | 7.78 | 3.90 | |
GoogLeNet-32b | 134.90 | 67.38 | 33.60 | 16.81 | |
VGGNet-32b | 410.3 | 216.0 | 113.8 | 64.7 | 38.5 |
See wiki