/ddl-benchmarks

ddl-benchmarks: Benchmarks for Distributed Deep Learning

Primary LanguagePython

ddl-benchmarks: Benchmarks for Distributed Deep Learning.

Introduction

This repository contains a set of benchmarking scripts for evaluating the training performance of popular distributed deep learning methods in our paper, which mainly focuses on system-level optimization algorithms of synchronized stochastic gradient descent with data parallelism. Currently, it covers:

system architectures

optimization algorithms

  • Wait-free backpropagation (WFBP), which is also known as the technique of pipelining the backward computations with gradient communications and it is a default feature in current deep learning frameworks.
  • Tensor fusion, which has been integraded in Horovod with a hand-craft threshold to determine when to fuse tensors, but it is possible to dynamically determine to fuse tensors in MG-WFBP.
  • Tensor partition and priority schedule, which are proposed in ByteScheduler.
  • Gradient compression with quantization (i.e., signSGD) and sparsification (i.e., TopK-SGD). These methods are included in the code, but they are excluded from our paper as the paper focuses on the system-level optimization methods.

deep neural networks

Installation

Prerequisites

Get the code

$git clone https://github.com/HKBU-HPML/ddl-benchmarks.git
$cd ddl-benchmarks 
$pip install -r requirements.txt

Configure the cluster settings

Before running the scripts, please carefully configure the configuration files in the directory of configs.

  • configs/cluster*: configure the host files for MPI
  • configs/envs.conf: configure the cluster enviroments.

Create a log folder, e.g.,

$mkdir -p logs/pcie

Run benchmarks

  • The batch mode
$python benchmarks.py
  • The individual mode, e.g.,
$cd horovod
$dnn=resnet50 bs=64 nworkers=64 ./horovod_mpi_cj.sh

Paper

If you are using this repository for your paper, please cite our work

@article{shi2020ddlsurvey,
    author = {Shi, Shaohuai and Tang, Zhenheng and Chu, Xiaowen and Liu, Chengjian and Wang, Wei and Li, Bo},
    title = {Communication-Efficient Distributed Deep Learning: Survey, Evaluation, and Challenges},
    journal = {arXiv},
    url = {\url{https://arxiv.org/pdf/2005.13247.pdf}},
    year = {2020}
}