awesome-AIOps

A curated list of awesome academic researches and industrial materials about Artificial Intelligence for IT Operations (AIOps).

Researchers
Industrial Materials
Academic Materials
- Talks
- Workshops
Papers
Datasets
Others
- Courses

Researchers

China (& HK SAR)
Michael R. Lyu, CUHK	Dongmei Zhang, Microsoft	Pengfei Chen, SYSU	Dan Pei, Tsinghua
Xin Peng, Fudan
USA
Ryan Huang, JHU	Yingnong Dang, Microsoft	Christina Delimitrou, MIT EECS
Europe
Odej Kao, TU Berlin
Australia
Hongyu Zhang, UON

Industrial Materials

Tools

[Log Analytics] LogPAI
[AI for Cloud Operation] OpsPAI
[Outlier Detection] PyOD
[Anomaly Detection] ADTK
[Anomaly Detection] PySAD
[Online Machine Learning] River
[Online Machine Learning] scikit-multiflow
[Fault Injection] Chaos Mesh
[Fault Injection] ChaosBlade
[Container Monitoring] cAdvisor
[Performance Monitoring] Netdata
[Anomaly Detection Labeling Tool] Microsoft TagAnomaly
[Serverless App Dev. Framework] AWS Serverless Application Model (AWS SAM)
[Performance Testing Tool] Locust

Companies

Datadog: A monitoring and security platform for cloud applications
必示 bizseer
博睿数据
听云 TINGYUN: 端到端的全平台应用性能管理系统
Loom Systems

Academic Materials

Talks

[Michael R. Lyu] Reliability-Driven AIOps for Cloud Resilience (Keynote talk at ICSE '21)

Workshops

Papers

Survey & Empirical Study

[arXiv '23] AI for IT Operations (AIOps) on Cloud Platforms: Reviews, Opportunities and Challenges
[CSUR '22] Anomaly Detection and Failure Root Cause Analysis in (Micro) Service-Based Cloud Applications: A Survey
[ASE '22] Going through the Life Cycle of Faults in Clouds: Guidelines on Fault Handling
[arXiv '21] Experience Report: Deep Learning-based System Log Analysis for Anomaly Detection
[CSUR '21] A Survey on Automated Log Analysis for Reliability Engineering
[ESEC/FSE '20] Towards intelligent incident management: why we need it and how we make it
[arXiv '20] A Systematic Mapping Study in AIOps
[ICSE '19] AIOps: Real-World Challenges and Research Innovations
[HotOS '19] What bugs cause production cloud incidents?
[ISSRE '16] Experience Report: System Log Analysis for Anomaly Detection
[ASE '13] Software analytics for incident management of online services: An experience report

Benchmarks

(Large) Language Models for IT Operations

Knowledge Graph for AIOps

[ICSE-SEIP '22] Mining Root Cause Knowledge from Cloud Service Incident Investigations for AIOps
[ICSE-SEIP '21] Neural knowledge extraction from cloud service incidents
[arXiv '21] SoftNER: Mining Knowledge Graphs From Cloud Incidents
[APPLSCI '20] A Causality Mining and Knowledge Graph Based Method of Root Cause Diagnosis for Performance Anomaly in Cloud Applications

Microservices and Serverless

Dependency and Tracing

[ASE '21] AID: Efficient Prediction of Aggregated Intensity of Dependency in Large-scale Cloud Systems [code]
[NSDI '07] X-Trace: A Pervasive Network Tracing Framework
[HotNets '06] Discovering Dependencies for Network Management

Detection and Localization of Anomaly/Failure

[ICSE '23] CONAN: Diagnosing Batch Failures for Cloud Systems
[ISSRE '22] Share or Not Share? Towards the Practicability of Deep Models for Unsupervised Anomaly Detection in Modern Online Systems [code]
[ICSE '22] Adaptive Performance Anomaly Detection for Online Service Systems via Pattern Sketching [code]
[KDD '19] Time-Series Anomaly Detection Service at Microsoft
[OSDI '18] Capturing and Enhancing In Situ System Observability for Failure Detection
[ESEC/FSE '18] Identifying Impactful Service System Problems via Log Analysis
[CCS '17] DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning

Incident and Alarm Management

[ATC '23] AutoARTS: Taxonomy, Insights and Tools for Root Cause Labelling of Incidents in Microsoft Azure
[ICSE '23] Incident-aware Duplicate Ticket Aggregation for Cloud Systems
[SoCC '22] How to Fight Production Incidents? An Empirical Study on a Large-scale Cloud Service
[DSN '22] Characterizing and Mitigating Anti-patterns of Alerts in Industrial Cloud Systems
[USENIX ATC '21] Fighting the Fog of War: Automated Incident Detection for Cloud Systems
[ASE '21] Graph-based Incident Aggregation for Large-Scale Online Service Systems
[ASE '21] Groot: An Event-graph-based Approach for Root Cause Analysis in Industrial Settings
[SIGCOMM '20] Scouts: Improving the Diagnosis Process Through Domain-customized Incident Routing
[ASE '20] How Incidental are the Incidents?: Characterizing and Prioritizing Incidents for Large-Scale Online Service Systems
[ESEC/FSE '20] Identifying linked incidents in large-scale online service systems
[ESEC/FSE '20] Efficient incident identification from multi-dimensional issue reports via meta-heuristic search
[ESEC/FSE '20] Real-time incident prediction for online service systems
[ESEC/FSE '20] How to mitigate the incident? an effective troubleshooting guide recommendation technique for online service systems
[ICSE '20] Understanding and Handling Alert Storm for Online Service Systems
[HotOS '19] What bugs cause production cloud incidents?
[ASE '19] Continuous Incident Triage for Large-Scale Online Service Systems
[ICSE '19] An empirical investigation of incident triage for online service systems
[WWW '19] Outage Prediction and Diagnosis for Cloud Service Systems
[KDD '14] Correlating Events with Time Series for Incident Diagnosis

Node, Disk, and Storage

[FAST '23] Perseus: A Fail-Slow Detection Framework for Cloud Storage Systems [data]
[DSN '21] General Feature Selection for Failure Prediction in Large-scale SSD Deployment
[TOSEM '20] Predicting Node Failures in an Ultra-Large-Scale Cloud Computing Platform: An AIOps Solution
[ICDCS '20] Toward Adaptive Disk Failure Prediction via Stream Mining
[VLDB '20] Diagnosing root causes of intermittent slow queries in cloud databases
[USENIX ATC '19] IASO: A Fail-Slow Detection and Mitigation Framework for Distributed Storage Services
[NSDI '18] Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure
[ESEC/FSE '18] Predicting Node Failure in Cloud Service Systems
[USENIX ATC '18] Improving Service Availability of Cloud Systems by Predicting Disk Error

VM Analysis and Management

Deployment

Datasets

[CUHK] Loghub
[Microsoft Azure] Azure Public Dataset
[Tsinghua] AIOps Challenge Dataset
[Google] Cluster Traces
[Backblaze] Hard Drive Dataset
[Baidu] SMART Dataset of PAKDD CUP 2020
[Alibaba] SSD SMART logs and failure data
[Alibaba] Alibaba Cluster Trace Program

Others

Courses

[Coursera] Cloud-Based Network Design & Management Techniques
[Tsinghua] AIOps Course of Tsinghua

talboren/awesome-AIOps

awesome-AIOps

Researchers

Industrial Materials

Competitions

White Papers

Blogs & Tutorials & Magazines

Benchmarks

Tools

Companies

Academic Materials

Talks

Workshops

Papers

Survey & Empirical Study

Benchmarks

(Large) Language Models for IT Operations

Knowledge Graph for AIOps

Microservices and Serverless

Dependency and Tracing

Detection and Localization of Anomaly/Failure

Incident and Alarm Management

Node, Disk, and Storage

VM Analysis and Management

Deployment

Datasets

Others

Courses