This module contains the FlexNetPacket packet-level network simulator. It models the cluster that runs DNN training job based on TopoOpt, Fat-Tree, SiP-ML, Expander and abstract switch network topologies. The source code was extended from the Opera simulator from NSDI 2020, please check the original README file here.
To build the FlexNetPacket simulator, from the top level directory run:
export FF_HOME=<directory of FlexNet>
cd src/clos
make
cd datacenter
make
The executables are found in the src/clos/datacenter
folder. They have the name "htsim_...". The following table provides details on each executable:
Executable | Network Topology |
---|---|
htsim_tcp_fattree |
Fat-Tree network topology, single job |
htsim_tcp_flat |
Flat Network topology (switchless) for single job. Use this executable for TopoOpt and expander |
htsim_tcp_fc |
Fully connected topology for single job |
htsim_tcp_os_fattree |
Oversubscribed Fat-Tree where the ToR switches are oversubscribed |
htsim_tcp_aggos_ft |
Oversubscribed Fat-Tree where the aggregation layer is oversubscribed |
htsim_tcp_dyn_flat |
Dynamic network used to simulate SiP-ML |
htsim_tcp_fattree_multijob |
Fat-Tree network topology used to simulate multiple DNN jobs concurrently |
htsim_tcp_aggos_fattree_multijob |
Oversubscribed (aggregation switches) Fat-Tree network topology used to simulate multiple DNN jobs concurrently |
FlexNetPacket's major extention from the htsim simulator allows it to take a taskgraph (in FlatBuffer) generated from FlexNet simulator. To achieve this, src/clos/ffapp.*
was implemented as an API to read and process these such taskgraphs. In addition, a few network topologies are added, notably the dynamic network executable that simulates SiP-ML. The network scheduler code can be found at src/clos/dyn_net_sch.*
.
Each topology's "main" function can be found in src/clos/datacenter/main_tcp_*.cpp
, which provides detailed description on the input arguments for the executable.