Zeus automatically optimizes the energy and time of training a DNN to a target validation metric by finding the optimal batch size and GPU power limit.
Please refer to our NSDI’23 publication for details. Checkout Overview for a summary.
Zeus is part of The ML.ENERGY Initiative.
.
├── zeus/ # ⚡ Zeus Python package
│ ├── run/ # - Tools for running Zeus on real training jobs
│ ├── policy/ # - Optimization policies and extension interfaces
│ ├── profile/ # - Tools for profiling energy and time
│ ├── simulate.py # - Tools for trace-driven simulation
│ ├── util/ # - Utility functions and classes
│ ├── analyze.py # - Analysis functions for power logs
│ ├── monitor.py # - Class for profiling energy inside training scripts
│ └── job.py # - Class for job specification
│
├── zeus_monitor/ # 🔌 GPU power monitor
│ ├── zemo/ # - A header-only library for querying NVML
│ └── main.cpp # - Source code of the power monitor
│
├── examples/ # 🛠️ Examples of integrating Zeus
│ ├── capriccio/ # - Integrating with Huggingface and Capriccio
│ ├── cifar100/ # - Integrating with torchvision and CIFAR100
│ └── trace_driven/ # - Using the Zeus trace-driven simulator
│
├── capriccio/ # 🌊 A drifting sentiment analysis dataset
│
└── trace/ # 🗃️ Train and power traces for various GPUs and DNNs
Refer to Getting started for complete instructions on environment setup, installation, and integration.
We provide a Docker image fully equipped with all dependencies and environments. The only command you need is:
docker run -it \
--gpus 1 `# Mount one GPU` \
--cap-add SYS_ADMIN `# Needed to change the power limit of the GPU` \
--shm-size 64G `# PyTorch DataLoader workers need enough shm` \
symbioticlab/zeus:latest \
bash
Refer to Environment setup for details.
We provide working examples for integrating and running Zeus:
- Integrating Zeus with Computer Vision
- Integrating Zeus with Natural Language Processing
- Running trace-driven simulation on single recurring jobs and the Alibaba GPU cluster trace
You can easily implement custom policies for batch size and power limit optimization and plug it into Zeus.
Refer to Extending Zeus for details.
Please consider citing our NSDI’23 paper if you find Zeus to be related to your research project.
@inproceedings{zeus-nsdi23,
title = {Zeus: Understanding and Optimizing {GPU} Energy Consumption of {DNN} Training},
author = {Jie You and Jae-Won Chung and Mosharaf Chowdhury},
booktitle = {USENIX NSDI},
year = {2023}
}
Jae-Won Chung (jwnchung@umich.edu)