awesome-AIOps

A curated list of awesome academic researches and industrial materials about Artificial Intelligence for IT Operations (AIOps).

Researchers
Industrial Materials
Academic Materials
- Talks
- Workshops
Papers
Datasets
Others
- Courses

Researchers

China (& HK SAR)
Michael R. Lyu, CUHK	Dongmei Zhang, Microsoft	Pengfei Chen, SYSU	Dan Pei, Tsinghua
Xin Peng, Fudan
USA
Ryan Huang, JHU	Yingnong Dang, Microsoft	Christina Delimitrou, MIT EECS
Europe
Odej Kao, TU Berlin
Australia
Hongyu Zhang, UON

Industrial Materials

Tools

[Log Analytics] LogPAI
[AI for Cloud Operation] OpsPAI
[Outlier Detection] PyOD
[Anomaly Detection] ADTK
[Anomaly Detection] PySAD
[Fault Injection] Chaos Mesh
[Fault Injection] ChaosBlade
[Container Monitoring] cAdvisor
[Performance Monitoring] Netdata
[Anomaly Detection Labeling Tool] Microsoft TagAnomaly
[Serverless App Dev. Framework] AWS Serverless Application Model (AWS SAM)

Companies

Datadog: A monitoring and security platform for cloud applications
必示 bizseer
听云 TINGYUN: 端到端的全平台应用性能管理系统
Loom Systems

Academic Materials

Talks

[Michael R. Lyu] Reliability-Driven AIOps for Cloud Resilience (Keynote talk at ICSE '21)

Workshops

Papers

Survey & Empirical Study

[arXiv '21] Experience Report: Deep Learning-based System Log Analysis for Anomaly Detection
[CSUR '21] A Survey on Automated Log Analysis for Reliability Engineering
[ESEC/FSE '20] Towards intelligent incident management: why we need it and how we make it
[arXiv '20] A Systematic Mapping Study in AIOps
[ICSE '19] AIOps: Real-World Challenges and Research Innovations
[ISSRE '16] Experience Report: System Log Analysis for Anomaly Detection
[ASE '13] Software analytics for incident management of online services: An experience report

Benchmarks

Knowledge Graph for AIOps

[ICSE-SEIP '22] Mining Root Cause Knowledge from Cloud Service Incident Investigations for AIOps
[ICSE-SEIP '21] Neural knowledge extraction from cloud service incidents
[arXiv '21] SoftNER: Mining Knowledge Graphs From Cloud Incidents
[APPLSCI '20] A Causality Mining and Knowledge Graph Based Method of Root Cause Diagnosis for Performance Anomaly in Cloud Applications

Microservices and Serverless

Dependency and Tracing

[ASE '21] AID: Efficient Prediction of Aggregated Intensity of Dependency in Large-scale Cloud Systems [code]
[NSDI '07] X-Trace: A Pervasive Network Tracing Framework
[HotNets '06] Discovering Dependencies for Network Management

Anomaly and Failure Detection

[ICSE '22] Adaptive Performance Anomaly Detection for Online Service Systems via Pattern Sketching [code]
[KDD '19] Time-Series Anomaly Detection Service at Microsoft
[OSDI '18] Capturing and Enhancing In Situ System Observability for Failure Detection
[ESEC/FSE '18] Identifying Impactful Service System Problems via Log Analysis
[CCS '17] DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning

Incident and Alarm Management

[DSN '22] Characterizing and Mitigating Anti-patterns of Alerts in Industrial Cloud Systems
[USENIX ATC '21] Fighting the Fog of War: Automated Incident Detection for Cloud Systems
[ASE '21] Graph-based Incident Aggregation for Large-Scale Online Service Systems
[ASE '20] How Incidental are the Incidents?: Characterizing and Prioritizing Incidents for Large-Scale Online Service Systems
[ESEC/FSE '20] Identifying linked incidents in large-scale online service systems
[ESEC/FSE '20] Efficient incident identification from multi-dimensional issue reports via meta-heuristic search
[ESEC/FSE '20] Real-time incident prediction for online service systems
[ESEC/FSE '20] How to mitigate the incident? an effective troubleshooting guide recommendation technique for online service systems
[ICSE '20] Understanding and Handling Alert Storm for Online Service Systems
[HotOS '19] What bugs cause production cloud incidents?
[ASE '19] Continuous Incident Triage for Large-Scale Online Service Systems
[ICSE '19] An empirical investigation of incident triage for online service systems
[WWW '19] Outage Prediction and Diagnosis for Cloud Service Systems
[KDD '14] Correlating Events with Time Series for Incident Diagnosis

Node, Disk, and Storage

[TOSEM '20] Predicting Node Failures in an Ultra-Large-Scale Cloud Computing Platform: An AIOps Solution
[ICDCS '20] Toward Adaptive Disk Failure Prediction via Stream Mining
[USENIX ATC '19] IASO: A Fail-Slow Detection and Mitigation Framework for Distributed Storage Services
[ESEC/FSE '18] Predicting Node Failure in Cloud Service Systems
[USENIX ATC '18] Improving Service Availability of Cloud Systems by Predicting Disk Error

VM Analysis and Management

Deployment

Datasets

[CUHK] Loghub
[Microsoft Azure] Azure Public Dataset
[Tsinghua] AIOps Challenge Dataset
[Google] Cluster Traces
[Backblaze] Hard Drive Dataset
[Baidu] SMART Dataset of PAKDD CUP 2020

Others

Courses

[Coursera] Cloud-Based Network Design & Management Techniques
[Tsinghua] AIOps Course of Tsinghua

Willianflower/awesome-AIOps

awesome-AIOps

Researchers

Industrial Materials

Competitions

White Papers

Blogs & Tutorials & Magazines

Benchmarks

Tools

Companies

Academic Materials

Talks

Workshops

Papers

Survey & Empirical Study

Benchmarks

Knowledge Graph for AIOps

Microservices and Serverless

Dependency and Tracing

Anomaly and Failure Detection

Incident and Alarm Management

Node, Disk, and Storage

VM Analysis and Management

Deployment

Datasets

Others

Courses