This is the official repository corresponding to the paper titled "CausIL: Causal Graph for Instance Level Microservice Data" accepted at the Proeedings of The Web Conference 2023 (WWW '23), Austin, Texas, USA.
Please cite our paper in any published work that uses any of these resources.
@inproceedings{10.1145/3543507.3583274,
author = {Chakraborty, Sarthak and Garg, Shaddy and Agarwal, Shubham and Chauhan, Ayush and Saini, Shiv Kumar},
title = {CausIL: Causal Graph for Instance Level Microservice Data},
year = {2023},
isbn = {9781450394161},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3543507.3583274},
doi = {10.1145/3543507.3583274},
abstract = {AI-based monitoring has become crucial for cloud-based services due to its scale. A common approach to AI-based monitoring is to detect causal relationships among service components and build a causal graph. Availability of domain information makes cloud systems even better suited for such causal detection approaches. In modern cloud systems, however, auto-scalers dynamically change the number of microservice instances, and a load-balancer manages the load on each instance. This poses a challenge for off-the-shelf causal structure detection techniques as they neither incorporate the system architectural domain information nor provide a way to model distributed compute across varying numbers of service instances. To address this, we develop CausIL, which detects a causal structure among service metrics by considering compute distributed across dynamic instances and incorporating domain knowledge derived from system architecture. Towards the application in cloud systems, CausIL estimates a causal graph using instance-specific variations in performance metrics, modeling multiple instances of a service as independent, conditional on system assumptions. Simulation study shows the efficacy of CausIL over baselines by improving graph estimation accuracy by ∼ 25% as measured by Structural Hamming Distance whereas the real-world dataset demonstrates CausIL’s applicability in deployment settings.},
booktitle = {Proceedings of the ACM Web Conference 2023},
pages = {2905–2915},
numpages = {11},
keywords = {System Monitoring, Causal Graph, Microservices, Causal Structure Detection},
location = {Austin, TX, USA},
series = {WWW '23}
}
AI-based monitoring has become crucial for cloud-based services due to its scale. A common approach to AI-based monitoring is to detect causal relationships among service components and build a causal graph. Availability of domain information makes cloud systems even better suited for such causal detection approaches. In modern cloud systems, however, auto-scalers dynamically change the number of microservice instances, and a load-balancer manages the load on each instance. This poses a challenge for off-the-shelf causal structure detection techniques as they neither incorporate the system architectural domain information nor provide a way to model distributed compute across varying numbers of service instances. To address this, we develop CausIL, which detects a causal structure among service metrics by considering compute distributed across dynamic instances and incorporating domain knowledge derived from system architecture. Towards the application in cloud systems, CausIL estimates a causal graph using instance-specific variations in performance metrics, modeling multiple instances of a service as independent, conditional on system assumptions. Simulation study shows the efficacy of CausIL over baselines by improving graph estimation accuracy by ~25% as measured by Structural Hamming Distance whereas the real-world dataset demonstrates CausIL's applicability in deployment settings.
-
Generate_ServiceGraph.py
: Generates 10 random directed acyclic call graph between multiple services, and stores it inData/{N}_services
directory.usage: Generate_ServiceGraph.py [-h] -N NODES -E EDGES Generate Ground Truth Service Graph optional arguments: -h, --help show this help message and exit -N NODES, --nodes NODES Number of nodes -E EDGES, --edges EDGES Number of edges
-
GenerateSyntheticData.py
: Generates synthetic data given the service call graph path. Since our goal was to generate data that closely resembeles real data, we have trained 5 quadratic regression models, where 4 of them are for the interior and leaf nodes based on the causal metric graph (Section 4.1), while the other one (f1
) is to estimate the number of instance spawned when workload varies. However, we are not sharing the learned models because of the dataset was proprietary. However, one can learn the same for any dataset and save the models inquadratic_models
directory.EXOG_PATH
is the path to the fole which contains the real workload distribution. For synthetic data, we sample random workload from a normal distribution with mean and variance equal to the mean and variance of the real exogenous workload.usage: GenerateSyntheticData.py [-h] -N NODES [-L LAG] [--path_exog PATH_EXOG] Generate Synthetic Data optional arguments: -h, --help show this help message and exit -N NODES, --nodes NODES Number of services in service graph -L LAG, --lag LAG Lag for workload to affect number of resources [default: 1] --path_exog PATH_EXOG Path to exogneous workload
-
GenerateSemiSyntheticData.py
: Generates semi-synthetic data in the process similar to the above. However, the learned modelsf1, ..., f5
are random forest models and hence can estimate values closer to the real system values. These models need to be saved inrf_models
directory. The path to the real workload which will be used as an exogenous data need to be specified inEXOG_PATH
.usage: GenerateSemiSyntheticData.py [-h] -N NODES [-L LAG] [--path_exog PATH_EXOG] Generate Synthetic Data optional arguments: -h, --help show this help message and exit -N NODES, --nodes NODES Number of services in service graph -L LAG, --lag LAG Lag for workload to affect number of resources [default: 1] --path_exog PATH_EXOG Path to exogneous workload
Data will be stored in Data/<N>_services/<synthetic/semi_synthetic>/Graph<graph_number>/Data.pkl
.
To generate the prohibited edge list which will be given as a domain knowledge, run CreateProhibEdges.py
. Given the number of service N
as an input, it will read the specific call graph from the correct directory and generate the list of edges that are prohibited and store it in the respective directories where the data is present.
We are sharing the synthetic data, generated based on the above steps. The link to download data can be found at this link.
The file contains synthetically generated dataset used in the evaluation of the paper "CausIL: Causal Graph for Instance Level Microservice Data". The number of services the service call graph used to generate the data is 10, 20 and 40, while each has 10 distinct distribution of synthetic data corresponding to the distinct call graph pattern. The data is stored in dictionary of dictionary format as below:
Data = { timstamp : {Metric_Service_agg : , Metric_Service_inst : <list of values where len(list) = # instances of the service at the timstamp>}}
The suffix 'agg' indicates the aggregated (averaged) metric value over all instances, while 'inst' indicates a list of metric values where each element in the list denotes the metric value of the corresponding instance/pod.
Metrics:
- R = # instances
- W = workload
- C = cpu utilization
- U = memory utilization
- E = error
- L = latency
In addition to the data, we also provide the service call graph from which it was generated (Graph<# services>.gpickle), the ground truth DAG causal graph (DAG.gpickle) and the prohibited edge list computed based on the domain knowledge for each such service call graph instance.
We recommend python3.8
to run this codebase.
- Create a virtual environment and run it
pip install virtualenv
virtualenv env
source env/bin/activate
- Install the required libraries by running
pip install -r requirements.txt
- Install library to run fges
cd LIB/
pip install -e .
cd ../
- Run
CausIL.py
python CausIL.py
usage: CausIL.py [-h] -D DATASET -S NUM_SERVICES [-G GRAPH_NUMBER] [--dk DK] [--score_func SCORE_FUNC]
Run CausIL
optional arguments:
-h, --help show this help message and exit
-D DATASET, --dataset DATASET
Dataset type (synthetic/semi_synthetic)
-S NUM_SERVICES, --num_services NUM_SERVICES
Numer of Services in the dataset (10, 20, etc.)
-G GRAPH_NUMBER, --graph_number GRAPH_NUMBER
Graph Instance in the particular dataset [default: 0]
--dk DK To use domain knowledge or not (Y/N) [default: Y]
--score_func SCORE_FUNC
Which score function to use (L: linear, P2: polynomial of degree 2, P3: polynomial of degree 3)
[default: P2]
- Perform the similar steps to run
Avg-fGES.py
.