/astra-sim

Primary LanguageC++MIT LicenseMIT

ASTRA-Sim

What is this repository for?

This is the ASTRA-sim distributed Deep Learning Training simulator, developed in collaboration between Georgia Tech, Facebook and Intel.

An overview is presented here: alt text

The full description of the tool and its strength can be found in the paper below:

Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, and Tushar Krishna, "ASTRA-SIM: Enabling SW/HW Co-Design Exploration for Distributed DL Training Platforms" In Proc of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Apr 2020 [pdf][slides][video]

Bibtex

@inproceedings{astrasim,
    author       = {Saeed Rashidi and
                   Srinivas Sridharan and
                   Sudarshan Srinivasan and
                   Tushar Krishna},
    title        = {{ASTRA-SIM: Enabling SW/HW Co-Design Exploration for Distributed DL Training Platforms}},
    booktitle     = {{IEEE} International Symposium on Performance Analysis of Systems
                    and Software, {ISPASS} 2020, Boston, MA, USA, August 22-26, 2020},
  publisher     = {{IEEE}},
  year          = {2020},
}

Setup Instructions

# Clone the repository
$ git clone https://github.com/astra-sim/astra-sim.git

# cloning the submodules
$ cd astra-sim
$ git submodule init
$ git submodule update

Instructions for compiling & running Garnet2.0 as the network simulator

  1. Run "./build/astra_garnet/build.sh -c" to compile and integrate astra-sim with gem5 (-l flag will clean the compilation). This will create a binary file where garnet is integrated with astra-sim. The analytical backend is hosted at https://github.com/georgia-tech-synergy-lab/gem5_astra .
  2. Run an example inside the "examples/" directory with garnet as a backend. Example: "examples/run_allreduce.sh -n garnet". This command will run a single all-reduce collective on a Torus topology.
  3. The results of example script runs will be dumped inside "examples/results/" path.

Instructions for compiling & running analytical backend as the network simulator

  1. Run "./build/astra_analytical/build.sh -c" to compile and integrate astra-sim with gem5 (-l flag will clean the compilation). This will create a binary file where analytical backend is integrated with astra-sim. The analytical backend is hosted at https://github.com/astra-sim/analytical .
  2. Run an example inside the "examples/" directory with garnet as a backend. Example: "examples/run_allreduce.sh -n analytical". This command will run a single all-reduce collective on a Torus topology.
  3. The results of example script runs will be dumped inside "examples/results/" path.

Instructions for compiling & running NS3 as the network simulator

Coming Soon!

NOTE: The on-screen reported delays (no matter what backend is used) after the end of simulation are in cycles while the delays inside the csv files are in terms of microseconds.

ASTRA-SIM Binary Command Line Options

When running the binary file (no matter what backend is used), the following options may be passed to the binary file (see example scripts):

--network-configuration (required): The network input file dir.

--system-configuration (required): The system input file dir.

--workload-configuration (required): The workload input file dir.

--path (required): The path to dump the results.

--run-name (required): Name of the current run.

--num-passes (required): Number of training passes to simulate.

--total-stat-rows (required): Total number of runs that want to write to the same csv file (please see run_multi.sh inside the "examples/"" directory). This is useful when multiple runs want to write to the same csv file. This value should be 1 if only 1 run is executed.

--stat-row (required): The position of the run to write its stats into the csv stat files (please see run_multi.sh inside the "examples/"" directory). This is useful when multiple runs want to write to the same csv file. This value should be 0 if only 1 run is executed.

--compute-scale (optional): Scales the all compute times (reported in the workload input file) by this scale. Tge default value is 1.

--comm-scale (optional): Scales the all communication sizes (reported in the workload input file) by this scale. Tge default value is 1.

NOTE: The garnet+astra-sim binary also allows all of the network input options be overridden by the command line options.

Input Files to ASTRA-sim

  • Workload: inputs/workload/
    • see inputs/workload/README.md
    • see scripts/workload_generator/README.md for instruction on how to use an automated script to generate workload input files.
  • System: inputs/system/
    • see inputs/system/README.md
  • Network:
    • inputs/network/garnet (for garnet backend inputs)
      • see inputs/network/garnet/README.md
    • inputs/network/analytical (for analytical backend inputs)
      • see inputs/network/analytical/README.md

Contact

Please email Saeed Rashidi (saeed.rashidi@gatech.edu) or Srinivas Sridharan (ssrinivas@fb.com) or Tushar Krishna (tushar@ece.gatech.edu) if you have any questions.

Core Developers

  • Saeed Rashidi (Georgia Tech)
  • Srinivas Sridharan (Facebook)

Additional Contributors

  • Jiayi Huang (University of California, Santa Barbara)
  • Apurve Chawde (Georgia Tech)
  • Santosh Kumar Elangoven (Georgia Tech)
  • William Won (Georgia Tech)
  • Tushar Krishna (Georgia Tech)
  • Greg Steinbrecher (Facebook)