AIOps - Papers and Tutorials

AwesomeAIOps (Papers, Tutorials, and Datasets)

Tutorials and Suveys

  1. Awesome AIOps [GitHub]
  2. Anomaly Detection : A Survey [Paper]
  3. AIOps based on machine learning (Chinese) [Slides]
  4. A collection of tools and datasets for anomaly detection on time-series data [GitHub]

Papers

  1. GRANO: Interactive Graph-based Root Cause Analysis for Cloud-Native Distributed Data Platform (VLDB 2019) [Paper] 🌟
  • Grano提供:一个检测层,用于处理大量时间序列监视数据,以检测逻辑和物理系统组件的异常;具有新颖图形模型和算法的异常图形层,可利用系统拓扑数据和检测结果在系统组件级别识别根本原因;应用层自动通知待命人员,并通过交互式图形界面提供实时和按需的RCA支持。
  • Key points (from Sec 3): use graph modeling and propagation algorithm to measure the importance of detection events and minimize the effect of false positive alarms
  • Step 1 - Graph Construction, which forms a unified anomaly graph G = (V, E). V contains (1) the set of system components, and (2) the set of alarms and events retrieved. Each edge in E represents the interdependency between components.
  • Step 2 - Alarm Edge Scoring, which evaluates the alarm’s importance to a connected system component.
  • Step 3 - Component Node Scoring, which calculates the aggregated confidence score on the components, using the criticality of the alarms and the edges’ score that connected to a component.
  • Step 4 - Score Propagation, which detects the actual root cause.
  1. iDice: Problem Identification for Emerging Issues (ICSE 2016) [Paper]
  • identify the effective combination for an emerging issue with high quality and performance.
  1. Opprentice: Towards Practical and Automatic Anomaly Detection Through Machine Learning (IMC 2015) [Paper]
  2. HotSpot: Anomaly Localization for Additive KPIs With Multi-Dimensional Attributes (IEEE Access 2018) [Paper]
  3. AutoMAP: Diagnose Your Microservice-based Web Applications Automatically (WWW 2020) [Paper] [GitHub] [Notes]
  • AutoMAP的模型利用多维时序指标来动态生成服务关联并诊断根因。针对多维时间序列,该工作分析时序之间的异常关联,推断异常行为图来描述不同服务之间的相关性。根据行为图,该工作使用前向、自向和后向随机游走算法设计启发式模型,用以识别服务故障的根本原因。AutoMAP可以快速部署在各种基于微服务的系统中,无需专家经验知识启动。同时,它也支持引入专家知识以提高诊断的准确性。
  • (1) Sampling, (2) Build anomaly behavior graph, (3) graph correction, (4) heuristic root cause detection algorithm on the graph, (5) performance analysis, (6) update the weights.
  1. Graph-based root cause analysis for service-oriented and micro service architectures (The Journal of Systems and Software 2020) [Paper]
  • 该类根因分析一般具有明确的服务调用拓扑关系(图),旨在捕捉服务链路上的异常,定位异常实体
  • 在这项工作中,我们基于这些体系结构的图形表示形式,提出了根本原因分析框架。这些图可用于将系统中发生的任何异常情况与异常图库进行比较,该库可作为用户对这些异常进行故障排除的知识库。
  • Please Refer to Section 4.
  1. Performance Monitoring and Root Cause Analysis for Cloud-hosted Web Applications (WWW 2017) [Paper]
  2. Survey on Models and Techniques for Root-Cause Analysis [Paper]
  3. Graph-Based Trace Analysis for Microservice Architecture Understanding and Problem Diagnosis (FSE 2020) [Paper]
  • The same author as GRANO.
  • GMTA, a graph-based approach of microservice trace analysis, for understanding architecture and diagnosing various problems.
  • GMTA abstracts traces into different paths and further groups them into business flows. To support various analytical applications, GMTA includes an efficient storage and access mechanism by combining a graph database and a real-time analytics database and using a carefully designed storage structure.
  1. Reliability Analytics for Cloud Based Distributed Databases (SIGMOD 2020, industry track) [Paper] 🌟
  • RADD, an innovative analytic pipeline used to measure reliability and availability for cloud-based distributed databases by leveraging the vast amount of telemetry present in the cloud.
  • RADD implements an event correlation framework that puts the emphasis on data compliance and uses information entropy to measure causality and reduce noisy signals.
  1. Root Cause Analysis of Anomalies of Multitier Services in Public Clouds (IWQOS 2018) [Paper]
  • Particularly, we elaborate a non-intrusive method to capture the dependency relationships of components, which improves the feasibility.
  • During localization, we exploit measurement data of both application layer and underlay infrastructure, and our two-step localization algorithm also includes a random walk procedure to model anomaly propagation probability (Section V-C).
  1. Latent Error Prediction and Fault Localization for Microservice Applications by Learning from System Trace Logs (FSE 2019)

Summary

  1. WeBank project - AIOps + KG [Link]
  2. Paper Summary (Chinese) [Link]
  3. Meituan project [Link]