Mobile and IoT applications increasingly adopt deep learning inference to provide intelligence. Inference requests are typically sent to a cloud infrastructure over a wireless network that is highly variable, leading to the challenge of dynamic Service Level Objectives (SLOs) at the request level. This paper presents Sponge, a novel deep learning inference serving system that maximizes resource efficiency while guaranteeing dynamic SLOs. Sponge achieves its goal by applying in-place vertical scaling, dynamic batching, and request reordering. Specifically, we introduce an Integer Programming formulation to capture the resource allocation problem, providing a mathematical model of the relationship between latency, batch size, and resources. We demonstrate the potential of Sponge through a prototype implementation and preliminary experiments and discuss future works.
-
Go to the infrastructure for the guide to set up the K8S cluster and related depandancies, the complete installtion takes ~30 minutes.
-
Get the minikube ip using
minikube ip
and add it to your configs. -
Dyaninf uses config yaml files for running experiments, the config files used in the paper are stored in the
data/configs/final
folder. And then do the following:
cd experiments/runner
and run the experiments for the appropriate config file:
python runner_script.py --config-name <config-name>
Please use the following citation if you use this framework:
@inproceedings{razavi2024sponge,
title={Sponge: Inference Serving with Dynamic SLOs Using In-Place Vertical Scaling},
author={Razavi, Kamran and Ghafouri, Saeid and M{\"u}hlh{\"a}user, Max and Jamshidi, Pooyan and Wang, Lin},
booktitle={Proceedings of the 4th Workshop on Machine Learning and Systems},
pages={184--191},
year={2024}
}