ASTRA-Sim
What is this repository for?
This is the ASTRA-sim distributed Deep Learning Training simulator, developed in collaboration between Georgia Tech, Facebook and Intel.
An overview is presented here:
The full description of the tool and its strength can be found in the paper below:
Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, and Tushar Krishna, "ASTRA-SIM: Enabling SW/HW Co-Design Exploration for Distributed DL Training Platforms" In Proc of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Apr 2020 [pdf][slides][video]
Bibtex
@inproceedings{astrasim,
author = {Saeed Rashidi and
Srinivas Sridharan and
Sudarshan Srinivasan and
Tushar Krishna},
title = {{ASTRA-SIM: Enabling SW/HW Co-Design Exploration for Distributed DL Training Platforms}},
booktitle = {{IEEE} International Symposium on Performance Analysis of Systems
and Software, {ISPASS} 2020, Boston, MA, USA, August 22-26, 2020},
publisher = {{IEEE}},
year = {2020},
}
Setup Instructions
# Clone the repository
$ git clone https://github.com/astra-sim/astra-sim.git
# cloning the submodules
$ cd astra-sim
$ git submodule init
$ git submodule update
Instructions for compiling & running Garnet2.0 as the network simulator
- Run "./build/astra_garnet/build.sh -c" to compile and integrate astra-sim with gem5 (-l flag will clean the compilation). This will create a binary file where garnet is integrated with astra-sim. The analytical backend is hosted at https://github.com/georgia-tech-synergy-lab/gem5_astra .
- Run an example inside the "examples/" directory with garnet as a backend. Example: "examples/run_allreduce.sh -n garnet". This command will run a single all-reduce collective on a Torus topology.
- The results of example script runs will be dumped inside "examples/results/" path.
Instructions for compiling & running analytical backend as the network simulator
- Run "./build/astra_analytical/build.sh -c" to compile and integrate astra-sim with gem5 (-l flag will clean the compilation). This will create a binary file where analytical backend is integrated with astra-sim. The analytical backend is hosted at https://github.com/astra-sim/analytical .
- Run an example inside the "examples/" directory with garnet as a backend. Example: "examples/run_allreduce.sh -n analytical". This command will run a single all-reduce collective on a Torus topology.
- The results of example script runs will be dumped inside "examples/results/" path.
Instructions for compiling & running NS3 as the network simulator
Coming Soon!
NOTE: The on-screen reported delays (no matter what backend is used) after the end of simulation are in cycles while the delays inside the csv files are in terms of microseconds.
ASTRA-SIM Binary Command Line Options
When running the binary file (no matter what backend is used), the following options may be passed to the binary file (see example scripts):
--network-configuration (required): The network input file dir.
--system-configuration (required): The system input file dir.
--workload-configuration (required): The workload input file dir.
--path (required): The path to dump the results.
--run-name (required): Name of the current run.
--num-passes (required): Number of training passes to simulate.
--total-stat-rows (required): Total number of runs that want to write to the same csv file (please see run_multi.sh inside the "examples/"" directory). This is useful when multiple runs want to write to the same csv file. This value should be 1 if only 1 run is executed.
--stat-row (required): The position of the run to write its stats into the csv stat files (please see run_multi.sh inside the "examples/"" directory). This is useful when multiple runs want to write to the same csv file. This value should be 0 if only 1 run is executed.
--compute-scale (optional): Scales the all compute times (reported in the workload input file) by this scale. Tge default value is 1.
--comm-scale (optional): Scales the all communication sizes (reported in the workload input file) by this scale. Tge default value is 1.
NOTE: The garnet+astra-sim binary also allows all of the network input options be overridden by the command line options.
Input Files to ASTRA-sim
- Workload: inputs/workload/
- see inputs/workload/README.md
- see scripts/workload_generator/README.md for instruction on how to use an automated script to generate workload input files.
- System: inputs/system/
- see inputs/system/README.md
- Network:
- inputs/network/garnet (for garnet backend inputs)
- see inputs/network/garnet/README.md
- inputs/network/analytical (for analytical backend inputs)
- see inputs/network/analytical/README.md
- inputs/network/garnet (for garnet backend inputs)
Contact
Please email Saeed Rashidi (saeed.rashidi@gatech.edu) or Srinivas Sridharan (ssrinivas@fb.com) or Tushar Krishna (tushar@ece.gatech.edu) if you have any questions.
Core Developers
- Saeed Rashidi (Georgia Tech)
- Srinivas Sridharan (Facebook)
Additional Contributors
- Jiayi Huang (University of California, Santa Barbara)
- Apurve Chawde (Georgia Tech)
- Santosh Kumar Elangoven (Georgia Tech)
- William Won (Georgia Tech)
- Tushar Krishna (Georgia Tech)
- Greg Steinbrecher (Facebook)