/CausIL

CausIL is an approach to estimate the causal graph for a cloud microservice system, where the nodes are the service-specific metrics while edges indicate causal dependency among the metrics. The approach considers metric variations for all the instances deployed in the system to build the causal graph and can account for auto-scaling decisions.

Primary LanguagePythonMIT LicenseMIT

CausIL: Causal Graph for Instance Level Microservice Data

This is the official repository corresponding to the paper titled "CausIL: Causal Graph for Instance Level Microservice Data" accepted at the Proeedings of The Web Conference 2023 (WWW '23), Austin, Texas, USA.

Please cite our paper in any published work that uses any of these resources.

@inproceedings{10.1145/3543507.3583274,
author = {Chakraborty, Sarthak and Garg, Shaddy and Agarwal, Shubham and Chauhan, Ayush and Saini, Shiv Kumar},
title = {CausIL: Causal Graph for Instance Level Microservice Data},
year = {2023},
isbn = {9781450394161},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3543507.3583274},
doi = {10.1145/3543507.3583274},
abstract = {AI-based monitoring has become crucial for cloud-based services due to its scale. A common approach to AI-based monitoring is to detect causal relationships among service components and build a causal graph. Availability of domain information makes cloud systems even better suited for such causal detection approaches. In modern cloud systems, however, auto-scalers dynamically change the number of microservice instances, and a load-balancer manages the load on each instance. This poses a challenge for off-the-shelf causal structure detection techniques as they neither incorporate the system architectural domain information nor provide a way to model distributed compute across varying numbers of service instances. To address this, we develop CausIL, which detects a causal structure among service metrics by considering compute distributed across dynamic instances and incorporating domain knowledge derived from system architecture. Towards the application in cloud systems, CausIL estimates a causal graph using instance-specific variations in performance metrics, modeling multiple instances of a service as independent, conditional on system assumptions. Simulation study shows the efficacy of CausIL over baselines by improving graph estimation accuracy by ∼ 25% as measured by Structural Hamming Distance whereas the real-world dataset demonstrates CausIL’s applicability in deployment settings.},
booktitle = {Proceedings of the ACM Web Conference 2023},
pages = {2905–2915},
numpages = {11},
keywords = {System Monitoring, Causal Graph, Microservices, Causal Structure Detection},
location = {Austin, TX, USA},
series = {WWW '23}
}

Abstract

AI-based monitoring has become crucial for cloud-based services due to its scale. A common approach to AI-based monitoring is to detect causal relationships among service components and build a causal graph. Availability of domain information makes cloud systems even better suited for such causal detection approaches. In modern cloud systems, however, auto-scalers dynamically change the number of microservice instances, and a load-balancer manages the load on each instance. This poses a challenge for off-the-shelf causal structure detection techniques as they neither incorporate the system architectural domain information nor provide a way to model distributed compute across varying numbers of service instances. To address this, we develop CausIL, which detects a causal structure among service metrics by considering compute distributed across dynamic instances and incorporating domain knowledge derived from system architecture. Towards the application in cloud systems, CausIL estimates a causal graph using instance-specific variations in performance metrics, modeling multiple instances of a service as independent, conditional on system assumptions. Simulation study shows the efficacy of CausIL over baselines by improving graph estimation accuracy by ~25% as measured by Structural Hamming Distance whereas the real-world dataset demonstrates CausIL's applicability in deployment settings.

Data Generation

  • Generate_ServiceGraph.py: Generates 10 random directed acyclic call graph between multiple services, and stores it in Data/{N}_services directory.

      usage: Generate_ServiceGraph.py [-h] -N NODES -E EDGES
    
      Generate Ground Truth Service Graph
    
      optional arguments:
        -h, --help            show this help message and exit
        -N NODES, --nodes NODES
                              Number of nodes
        -E EDGES, --edges EDGES
                              Number of edges
    
  • GenerateSyntheticData.py: Generates synthetic data given the service call graph path. Since our goal was to generate data that closely resembeles real data, we have trained 5 quadratic regression models, where 4 of them are for the interior and leaf nodes based on the causal metric graph (Section 4.1), while the other one (f1) is to estimate the number of instance spawned when workload varies. However, we are not sharing the learned models because of the dataset was proprietary. However, one can learn the same for any dataset and save the models in quadratic_models directory. EXOG_PATH is the path to the fole which contains the real workload distribution. For synthetic data, we sample random workload from a normal distribution with mean and variance equal to the mean and variance of the real exogenous workload.

      usage: GenerateSyntheticData.py [-h] -N NODES [-L LAG] [--path_exog PATH_EXOG]
    
      Generate Synthetic Data
    
      optional arguments:
        -h, --help            show this help message and exit
        -N NODES, --nodes NODES
                              Number of services in service graph
        -L LAG, --lag LAG     Lag for workload to affect number of resources [default: 1]
        --path_exog PATH_EXOG
                              Path to exogneous workload 
    
  • GenerateSemiSyntheticData.py: Generates semi-synthetic data in the process similar to the above. However, the learned models f1, ..., f5 are random forest models and hence can estimate values closer to the real system values. These models need to be saved in rf_models directory. The path to the real workload which will be used as an exogenous data need to be specified in EXOG_PATH.

      usage: GenerateSemiSyntheticData.py [-h] -N NODES [-L LAG] [--path_exog PATH_EXOG]
    
      Generate Synthetic Data
    
      optional arguments:
        -h, --help            show this help message and exit
        -N NODES, --nodes NODES
                              Number of services in service graph
        -L LAG, --lag LAG     Lag for workload to affect number of resources [default: 1]
        --path_exog PATH_EXOG
                              Path to exogneous workload
    

Data will be stored in Data/<N>_services/<synthetic/semi_synthetic>/Graph<graph_number>/Data.pkl.

To generate the prohibited edge list which will be given as a domain knowledge, run CreateProhibEdges.py. Given the number of service N as an input, it will read the specific call graph from the correct directory and generate the list of edges that are prohibited and store it in the respective directories where the data is present.

Data

We are sharing the synthetic data, generated based on the above steps. The link to download data can be found at this link.

Data Description

The file contains synthetically generated dataset used in the evaluation of the paper "CausIL: Causal Graph for Instance Level Microservice Data". The number of services the service call graph used to generate the data is 10, 20 and 40, while each has 10 distinct distribution of synthetic data corresponding to the distinct call graph pattern. The data is stored in dictionary of dictionary format as below:

Data = { timstamp : {Metric_Service_agg : , Metric_Service_inst : <list of values where len(list) = # instances of the service at the timstamp>}}

The suffix 'agg' indicates the aggregated (averaged) metric value over all instances, while 'inst' indicates a list of metric values where each element in the list denotes the metric value of the corresponding instance/pod.

Metrics:

  • R = # instances
  • W = workload
  • C = cpu utilization
  • U = memory utilization
  • E = error
  • L = latency

In addition to the data, we also provide the service call graph from which it was generated (Graph<# services>.gpickle), the ground truth DAG causal graph (DAG.gpickle) and the prohibited edge list computed based on the domain knowledge for each such service call graph instance.

Implementation Steps

We recommend python3.8 to run this codebase.

  1. Create a virtual environment and run it
pip install virtualenv  
virtualenv env  
source env/bin/activate
  1. Install the required libraries by running
pip install -r requirements.txt
  1. Install library to run fges
cd LIB/
pip install -e .
cd ../
  1. Run CausIL.py
python CausIL.py
usage: CausIL.py [-h] -D DATASET -S NUM_SERVICES [-G GRAPH_NUMBER] [--dk DK] [--score_func SCORE_FUNC]

Run CausIL

optional arguments:
  -h, --help            show this help message and exit
  -D DATASET, --dataset DATASET
                        Dataset type (synthetic/semi_synthetic)
  -S NUM_SERVICES, --num_services NUM_SERVICES
                        Numer of Services in the dataset (10, 20, etc.)
  -G GRAPH_NUMBER, --graph_number GRAPH_NUMBER
                        Graph Instance in the particular dataset [default: 0]
  --dk DK               To use domain knowledge or not (Y/N) [default: Y]
  --score_func SCORE_FUNC
                        Which score function to use (L: linear, P2: polynomial of degree 2, P3: polynomial of degree 3)
                        [default: P2]
  1. Perform the similar steps to run Avg-fGES.py.