/TVDiag

TVDiag: A Task-oriented and View-invariant Failure Diagnosis Framework with Multimodal Data

Primary LanguagePythonMIT LicenseMIT

TVDiag

TVDiag: A Task-oriented and View-invariant Failure Diagnosis Framework with Multimodal Data

TVDiag is a multimodal failure diagnosis framework designed to locate the root cause and identify the failure type in microservice-based systems. This repository offers the core implementation of TVDiag.

Project Structure

.
├── core
│   ├── loss
│   │   ├── AutomaticWeightedLoss.py
│   │   ├── SupervisedContrastiveLoss.py
│   │   └── UnsupervisedContrastiveLoss.py
│   ├── model
│   │   ├── backbone
│   │   │   ├── FC.py
│   │   │   ├── sage.py
│   │   │   └── cnn1d.py
│   │   ├── Classifier.py
│   │   ├── Voter.py
│   │   ├── Encoder.py
│   │   └── MainModel.py
│   ├── aug.py
│   ├── ita.py
│   ├── multimodal_dataset.py
│   └── TVDiag.py
├── data
│   └── gaia
│       ├── tmp
│       ├── raw
│       └── label.csv
├── helper
│   ├── eval.py
│   ├── io_uitl.py
│   ├── logger.py
│   ├── scaler.py
│   └── time_util.py
├── process
│   ├── events
│   │   ├── fasttext_w2v.py
│   │   ├── cnn1d_w2v.py
│   │   └── lda_w2v.py
│   └── EventProcess.py
├── requirements.txt
├── README.md
├── train.sh
└── main.py

Dataset

We conducted experiments on two datasets:

  • GAIA. GAIA dataset records metrics, traces, and logs of the MicroSS simulation system in July 2021, which consists of ten microservices and some middleware such as Redis, MySQL, and Zookeeper. The extracted events of GAIA can be accessible on DiagFusion.
  • AIOps-22. The AIOps-22 dataset is derived from the training data released by the AIOps 2022 Challenge, where failures at three levels (node, service, and instance) were injected into a Web-based e-commerce platform Online-boutique.

Getting Started

Requirements

  • python=3.8.12
  • pytorch=2.1.1
  • fasttext=0.9.2
  • dgl=2.1.0.cu118 (my cuda version is 11.8)

Run

You can run the below commands:

sh train.sh

The parameters in main.py are described as follows:

Common args

  • dataset: The dataset that you want to use.
  • reconstruct: This parameter represents whether the events should be regenerated. (Default: False)

Model

  • TO: TO denotes whether the task-oriented learning module should be loaded. (Default: True)
  • CM: CM denotes whether the cross-modal association should be established. (Default: True)
  • dynamic_weight: dynamic_weight denotes whether weights are dynamically assigned for each loss. (Default: True)
  • guide_weight: This parameter adjusts the scale of the contrastive loss. (Default: 0.1)
  • temperature: This parameter adjusts the temprature parameter $\tau$, controlling the the attention to difficult samples. (Default: 0.3)
  • patience: This parameter adjusts the patience used in early break. (Default: 10)
  • aug_percent: The inactivation probability. (Default: 0.2)