A Framework for dynamically meeting performance objectives on a service mesh

We introduce a comprehensive framework designed to achieve end-to-end management objectives for multiple services concurrently operating within a service mesh. Leveraging reinforcement learning (RL) techniques, our framework trains an agent to periodically execute control actions for resource reallocation. Our development and evaluation take place within a laboratory testbed where information and computing services run on a service mesh supported by Istio and Kubernetes platforms.

Our investigation encompasses various management objectives, including enforcing end-to-end delay bounds on service requests, optimizing throughput, managing cost-related objectives, and implementing service differentiation. Notably, we compute control policies on a simulator instead of the testbed, significantly expediting the training process for the scenarios under study.

Distinguishing itself by advocating a top-down approach, our framework prioritizes the definition of management objectives before mapping them onto available control actions. This approach enables the concurrent execution of multiple control actions and facilitates training the agent for diverse management objectives in parallel by initially learning the system model and operating region from testbed traces.

Requirements

gym and gymnasium: for creating the RL environments
joblib: for loading/exporting random forest regressor models
sb3-contrib: for reinforcement learning agents (Maskable PPO)
scikit-learn: for random forest regression
scipy: for random forest regression
stable-baselines3: for reinforcement learning agents (PPO)
torch and torchvision: for neural network training
matplotlib: for plotting
pandas: for data wrangling
requests: for making HTTP requests

foroughsh/A_framework_for_meeting_MO_TNSM2023

A Framework for dynamically meeting performance objectives on a service mesh

Requirements

Development Requirements

Copyright and license

Authors & Maintainers