We introduce a comprehensive framework designed to achieve end-to-end management objectives for multiple services concurrently operating within a service mesh. Leveraging reinforcement learning (RL) techniques, our framework trains an agent to periodically execute control actions for resource reallocation. Our development and evaluation take place within a laboratory testbed where information and computing services run on a service mesh supported by Istio and Kubernetes platforms.
Our investigation encompasses various management objectives, including enforcing end-to-end delay bounds on service requests, optimizing throughput, managing cost-related objectives, and implementing service differentiation. Notably, we compute control policies on a simulator instead of the testbed, significantly expediting the training process for the scenarios under study.
Distinguishing itself by advocating a top-down approach, our framework prioritizes the definition of management objectives before mapping them onto available control actions. This approach enables the concurrent execution of multiple control actions and facilitates training the agent for diverse management objectives in parallel by initially learning the system model and operating region from testbed traces.
gym
andgymnasium
: for creating the RL environmentsjoblib
: for loading/exporting random forest regressor modelssb3-contrib
: for reinforcement learning agents (Maskable PPO)scikit-learn
: for random forest regressionscipy
: for random forest regressionstable-baselines3
: for reinforcement learning agents (PPO)torch
andtorchvision
: for neural network trainingmatplotlib
: for plottingpandas
: for data wranglingrequests
: for making HTTP requests
- Python 3.7+
flake8
(for linting)tox
(for automated testing)
Creative Commons (C) 2021-2024, Forough Shahabsamani
- Forough Shahabsamani foro@kth.se