awesome-AIOps
A curated list of awesome academic researches and industrial materials about Artificial Intelligence for IT Operations (AIOps).
Researchers
China (& HK SAR) | |||
---|---|---|---|
Michael R. Lyu, CUHK | Dongmei Zhang, Microsoft | Pengfei Chen, SYSU | Dan Pei, Tsinghua |
Xin Peng, Fudan | |||
USA | |||
Ryan Huang, JHU | Yingnong Dang, Microsoft | Christina Delimitrou, MIT EECS | |
Europe | |||
Odej Kao, TU Berlin | |||
Australia | |||
Hongyu Zhang, UON |
Industrial Materials
Competitions
- [AIOps Challenge] A series of AIOps competitions hosted by Tsinghua University
- [PAKDD2020] Alibaba AIOps Competition
White Papers
- [VMware] Proactive Incident and Problem Management
- [GREATOPS 高效运维社区] 《企业级 AIOps 实施建议》白皮书
- [Awesome Open Source] Aiops Handbook
Blogs & Tutorials & Magazines
- [Moogsoft] What is AIOps?
- [Microsoft] Advancing Azure service quality with artificial intelligence: AIOps
Benchmarks
- [Fudan] Train Ticket (A Benchmark Microservice System)
- [Weaveworks] Sock Shop (A Microservices Demo Application)
Tools
- [Log Analytics] LogPAI
- [AI for Cloud Operation] OpsPAI
- [Outlier Detection] PyOD
- [Anomaly Detection] ADTK
- [Anomaly Detection] PySAD
- [Fault Injection] Chaos Mesh
- [Fault Injection] ChaosBlade
- [Container Monitoring] cAdvisor
- [Performance Monitoring] Netdata
- [Anomaly Detection Labeling Tool] Microsoft TagAnomaly
- [Serverless App Dev. Framework] AWS Serverless Application Model (AWS SAM)
Companies
- Datadog: A monitoring and security platform for cloud applications
- 必示 bizseer
- 听云 TINGYUN: 端到端的全平台应用性能管理系统
- Loom Systems
Academic Materials
Talks
Workshops
- ICSE21 Workshop on Cloud Intelligence
- AAAI-20 Workshop on Cloud Intelligence
- AIOPS 2020 (International Workshop on Artificial Intelligence for IT Operations)
Papers
Survey & Empirical Study
- [arXiv '21] Experience Report: Deep Learning-based System Log Analysis for Anomaly Detection
- [CSUR '21] A Survey on Automated Log Analysis for Reliability Engineering
- [ESEC/FSE '20] Towards intelligent incident management: why we need it and how we make it
- [arXiv '20] A Systematic Mapping Study in AIOps
- [ICSE '19] AIOps: Real-World Challenges and Research Innovations
- [ISSRE '16] Experience Report: System Log Analysis for Anomaly Detection
- [ASE '13] Software analytics for incident management of online services: An experience report
Benchmarks
- [arXiv '22] Constructing Large-Scale Real-World Benchmark Datasets for AIOps
- [ASPLOS '19] An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud and Edge Systems
Knowledge Graph for AIOps
- [ICSE-SEIP '22] Mining Root Cause Knowledge from Cloud Service Incident Investigations for AIOps
- [ICSE-SEIP '21] Neural knowledge extraction from cloud service incidents
- [arXiv '21] SoftNER: Mining Knowledge Graphs From Cloud Incidents
- [APPLSCI '20] A Causality Mining and Knowledge Graph Based Method of Root Cause Diagnosis for Performance Anomaly in Cloud Applications
Microservices and Serverless
- [ASPLOS '21] Sage: Practical & Scalable ML-Driven Performance Debugging in Microservices
- [ICDCS '21] Defuse: A Dependency-Guided Function Scheduler to Mitigate Cold Starts on FaaS Platforms
- [FSE '20] Graph-based trace analysis for microservice architecture understanding and problem diagnosis
- [OSDI '20] FIRM: An Intelligent Fine-grained Resource Management Framework for SLO-Oriented Microservices
- [ESEC/FSE '19] Latent Error Prediction and Fault Localization for Microservice Applications by Learning from System Trace Logs
- [TSE '18] Fault Analysis and Debugging of Microservice Systems: Industrial Survey, Benchmark System, and Empirical Study
Dependency and Tracing
- [ASE '21] AID: Efficient Prediction of Aggregated Intensity of Dependency in Large-scale Cloud Systems [code]
- [NSDI '07] X-Trace: A Pervasive Network Tracing Framework
- [HotNets '06] Discovering Dependencies for Network Management
Anomaly and Failure Detection
- [ICSE '22] Adaptive Performance Anomaly Detection for Online Service Systems via Pattern Sketching [code]
- [KDD '19] Time-Series Anomaly Detection Service at Microsoft
- [OSDI '18] Capturing and Enhancing In Situ System Observability for Failure Detection
- [ESEC/FSE '18] Identifying Impactful Service System Problems via Log Analysis
- [CCS '17] DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning
Incident and Alarm Management
- [DSN '22] Characterizing and Mitigating Anti-patterns of Alerts in Industrial Cloud Systems
- [USENIX ATC '21] Fighting the Fog of War: Automated Incident Detection for Cloud Systems
- [ASE '21] Graph-based Incident Aggregation for Large-Scale Online Service Systems
- [ASE '20] How Incidental are the Incidents?: Characterizing and Prioritizing Incidents for Large-Scale Online Service Systems
- [ESEC/FSE '20] Identifying linked incidents in large-scale online service systems
- [ESEC/FSE '20] Efficient incident identification from multi-dimensional issue reports via meta-heuristic search
- [ESEC/FSE '20] Real-time incident prediction for online service systems
- [ESEC/FSE '20] How to mitigate the incident? an effective troubleshooting guide recommendation technique for online service systems
- [ICSE '20] Understanding and Handling Alert Storm for Online Service Systems
- [HotOS '19] What bugs cause production cloud incidents?
- [ASE '19] Continuous Incident Triage for Large-Scale Online Service Systems
- [ICSE '19] An empirical investigation of incident triage for online service systems
- [WWW '19] Outage Prediction and Diagnosis for Cloud Service Systems
- [KDD '14] Correlating Events with Time Series for Incident Diagnosis
Node, Disk, and Storage
- [TOSEM '20] Predicting Node Failures in an Ultra-Large-Scale Cloud Computing Platform: An AIOps Solution
- [ICDCS '20] Toward Adaptive Disk Failure Prediction via Stream Mining
- [USENIX ATC '19] IASO: A Fail-Slow Detection and Mitigation Framework for Distributed Storage Services
- [ESEC/FSE '18] Predicting Node Failure in Cloud Service Systems
- [USENIX ATC '18] Improving Service Availability of Cloud Systems by Predicting Disk Error
VM Analysis and Management
- [NSDI '22] CloudCluster: Unearthing the Functional Structure of a Cloud Service
- [OSDI '20] Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions
Deployment
- [SOSP '21] Understanding and Detecting Software Upgrade Failures in Distributed Systems
- [NSDI '20] Gandalf: An Intelligent, End-To-End Analytics Service for Safe Deployment in Large-Scale Cloud Infrastructure
Datasets
- [CUHK] Loghub
- [Microsoft Azure] Azure Public Dataset
- [Tsinghua] AIOps Challenge Dataset
- [Google] Cluster Traces
- [Backblaze] Hard Drive Dataset
- [Baidu] SMART Dataset of PAKDD CUP 2020
Others
Courses
- [Coursera] Cloud-Based Network Design & Management Techniques
- [Tsinghua] AIOps Course of Tsinghua