/Zeus

An energy optimization framework for DNN training.

Primary LanguagePythonApache License 2.0Apache-2.0

Zeus logo

An Energy Optimization Framework for DNN Training

arXiv Docker Hub Slack workspace Homepage build Apache-2.0 License

Zeus automatically optimizes the energy and time of training a DNN to a target validation metric by finding the optimal batch size and GPU power limit.

Please refer to our NSDI’23 publication for details. Checkout Overview for a summary.

Zeus is part of The ML.ENERGY Initiative.

Repository Organization

.
├── zeus/                # ⚡ Zeus Python package
│   ├── run/             #    - Tools for running Zeus on real training jobs
│   ├── policy/          #    - Optimization policies and extension interfaces
│   ├── profile/         #    - Tools for profiling energy and time
│   ├── simulate.py      #    - Tools for trace-driven simulation
│   ├── util/            #    - Utility functions and classes
│   ├── analyze.py       #    - Analysis functions for power logs
│   ├── monitor.py       #    - Class for profiling energy inside training scripts
│   └── job.py           #    - Class for job specification
│
├── zeus_monitor/        # 🔌 GPU power monitor
│   ├── zemo/            #    -  A header-only library for querying NVML
│   └── main.cpp         #    -  Source code of the power monitor
│
├── examples/            # 🛠️ Examples of integrating Zeus
│   ├── capriccio/       #    - Integrating with Huggingface and Capriccio
│   ├── cifar100/        #    - Integrating with torchvision and CIFAR100
│   └── trace_driven/    #    - Using the Zeus trace-driven simulator
│
├── capriccio/           # 🌊 A drifting sentiment analysis dataset
│
└── trace/               # 🗃️ Train and power traces for various GPUs and DNNs

Getting Started

Refer to Getting started for complete instructions on environment setup, installation, and integration.

Docker image

We provide a Docker image fully equipped with all dependencies and environments. The only command you need is:

docker run -it \
    --gpus 1                    `# Mount one GPU` \
    --cap-add SYS_ADMIN         `# Needed to change the power limit of the GPU` \
    --shm-size 64G              `# PyTorch DataLoader workers need enough shm` \
    symbioticlab/zeus:latest \
    bash

Refer to Environment setup for details.

Examples

We provide working examples for integrating and running Zeus:

Extending Zeus

You can easily implement custom policies for batch size and power limit optimization and plug it into Zeus.

Refer to Extending Zeus for details.

Citation

Please consider citing our NSDI’23 paper if you find Zeus to be related to your research project.

@inproceedings{zeus-nsdi23,
    title     = {Zeus: Understanding and Optimizing {GPU} Energy Consumption of {DNN} Training},
    author    = {Jie You and Jae-Won Chung and Mosharaf Chowdhury},
    booktitle = {USENIX NSDI},
    year      = {2023}
}

Contact

Jae-Won Chung (jwnchung@umich.edu)