/dml-network-traffic-analysis

Code for "Network Traffic Characteristics of Machine Learning Frameworks Under the Microscope" by Johannes Zerwas, Kaan Aykurt, Stefan Schmid and Andreas Blenk (CNSM 2021).

Primary LanguageJupyter NotebookGNU General Public License v3.0GPL-3.0

DML Network Traffic Measurements

This repository includes the source code for the measurements and the analysis scripts used for "Network Traffic Characteristics of Machine Learning Frameworks Under the Microscope" by Johannes Zerwas, Kaan Aykurt, Stefan Schmid and Andreas Blenk (2021).

The collected traces are available here.

Folders:

  • analysis/: parsing, aggregation and evaluation scripts
  • custombox: Vagrant VM files
  • frameworks/: Python automation of experiments. Scripts to run the DML trainings are in the sub-folders corresponding to each framework.

The remaining scripts and modules are shared for all the frameworks.

Requirements

On orchestrator/controller

  • Install necessary python packages on the orchestrator/controller machine:
pip install paramiko pandas numpy seaborn sklearn matplotlib statsmodels
  • Update config.py

On the worker machines:

  • Orchestrator must be able to SSH as root into worker nodes
  • Worker nodes' root user must have a keypair for SSH
  • Prepare the folder /root/dependencies with downloaded files under the following folder structre:
    • cuda-files (can be downloaded from NVIDIA Developer Program)
      • libcudnn7_7.6.5.32-1+cuda10.1_amd64.deb
      • libcudnn7-dev_7.6.5.32-1+cuda10.1_amd64.deb
      • libcudnn7-doc_7.6.5.32-1+cuda10.1_amd64.deb
    • golang (can be downladed from https://golang.org/)
      • go1.16.3.linux-amd64.tar.gz
  • Update custombox/custombox_ssh_install.sh with the SSH pub-key of the orchestrator/controller node

Running an Experiment

run_experiment.py script controls the whole VM creation and experiment running process. The script is run with following parameters:

  • framework: name of the framework
  • backend: name of the communication backend (only for naming convention, desired backend should be specified manually in custom box creation scripts)
  • models: models to be run, separated with commas
  • batchsizes: batch sizes of interest, separated with commas
  • topologies: topologies of interest, separated with commas (defaults to ring)
  • losses: losses of interest, separated with commas (defaults to 0)
  • usebox: flag to set intermediary custom box usage (currently only TensorFlow supports this, defaults to False)

Example:

python3 run_experiment.py --framework kungfu --backend kungfu --models mobilenetv2,densenet201 --batchsizes 64
--topologies BINARY_TREE_STAR,TREE,CLIQUE --losses 0,0.05,0.1

Notes:

  • Cuda and Golang dependencies should be in a file which should be passed through to the VM.
  • All VMs should be mounted a folder which the results are written into.
  • In case the nodes have different internet connection speeds, a dummy experiment has to be run so that each worker has a copy of the training dataset and no hanging occurs in the beginning.

Settings to run

Framework Backend Models Batch Sizes Topologies Losses
TensorFlow grpc, grpc_nccl mobilenetv2, densenet201, resnet50, resnet101 64, 128, 512 ring 0, 0.05, 0.1, 0.2, 0.5, 1, 2
Horovod mpi, mpi_nccl, gloo, gloo_nccl mobilenetv2, densenet201, resnet50, resnet101 64, 128, 512 ring 0
KungFu kungfu mobilenetv2, densenet201, resnet50, resnet101 64 BINARY_TREE_STAR, CLIQUE, STAR, TREE 0

Evaluating the Results

Raw traces are first parsed via parsing_script.py and then aggregated via aggregate.py under analysis/ folder. A Jupyter Notebook (analysis.ipynb) is used for the evaluation and visualization of the results. Folder names designate the experiment configuration in format framework-model-optimizer-batchsize-backend-delay-(topology).

  • Run parsing:
python parsing_script.py --path /path/to/datafoler
  • Run aggregation:
python aggregate.py --path /path/to/datafoler