A curated list of awesome academic researches and industrial materials about Artificial Intelligence for IT Operations (AIOps).
China (& HK SAR) | |||
---|---|---|---|
Michael R. Lyu, CUHK | Dongmei Zhang, Microsoft | Pengfei Chen, SYSU | Dan Pei, Tsinghua |
Xin Peng, Fudan | |||
USA | |||
Ryan Huang, JHU | Yingnong Dang, Microsoft | ||
Europe | |||
Odej Kao, TU Berlin | |||
Australia | |||
Hongyu Zhang, UON |
- [AIOps Challenge] A series of AIOps competitions hosted by Tsinghua University
- [PAKDD2020] Alibaba AIOps Competition
- [VMware] Proactive Incident and Problem Management
- [GREATOPS 高效运维社区] 《企业级 AIOps 实施建议》白皮书
- [Awesome Open Source] Aiops Handbook
- [Moogsoft] What is AIOps?
- [Microsoft] Advancing Azure service quality with artificial intelligence: AIOps
- [Fudan] Train Ticket (A Benchmark Microservice System)
- [Weaveworks] Sock Shop (A Microservices Demo Application)
- [Log Analytics] LogPAI
- [Outlier Detection] PyOD
- [Anomaly Detection] ADTK
- [Anomaly Detection] PySAD
- [Fault Injection] Chaos Mesh
- [Fault Injection] ChaosBlade
- [Container Monitoring] cAdvisor
- [Performance Monitoring] Netdata
- [Anomaly Detection Labeling Tool] Microsoft TagAnomaly
- [Serverless App Dev. Framework] AWS Serverless Application Model (AWS SAM)
- Datadog: A monitoring and security platform for cloud applications
- 必示 bizseer
- 听云 TINGYUN: 端到端的全平台应用性能管理系统
- Loom Systems
- ICSE21 Workshop on Cloud Intelligence
- AAAI-20 Workshop on Cloud Intelligence
- AIOPS 2020 (International Workshop on Artificial Intelligence for IT Operations)
- [arXiv'21] Experience Report: Deep Learning-based System Log Analysis for Anomaly Detection
- [CSUR'21] A Survey on Automated Log Analysis for Reliability Engineering
- [ESEC/FSE'20] Towards intelligent incident management: why we need it and how we make it
- [arXiv'20] A Systematic Mapping Study in AIOps
- [ICSE'19] AIOps: Real-World Challenges and Research Innovations
- [ISSRE'16] Experience Report: System Log Analysis for Anomaly Detection
- [ASE'13] Software analytics for incident management of online services: An experience report
- [ASE'21] AID: Efficient Prediction of Aggregated Intensity of Dependency in Large-scale Cloud Systems
- [NSDI'07] X-Trace: A Pervasive Network Tracing Framework
- [HotNets'06] Discovering Dependencies for Network Management
- [ICSE'22] Adaptive Performance Anomaly Detection for Online Service Systems via Pattern Sketching
- [KDD'19] Time-Series Anomaly Detection Service at Microsoft
- [OSDI'18] Capturing and Enhancing In Situ System Observability for Failure Detection
- [ESEC/FSE'18] Identifying Impactful Service System Problems via Log Analysis
- [CCS'17] DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning
- [OSDI'20] Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions
- [ESEC/FSE'18] Predicting Node Failure in Cloud Service Systems
- [ASPLOS'21] Sage: Practical & Scalable ML-Driven Performance Debugging in Microservices
- [OSDI'20] FIRM: An Intelligent Fine-grained Resource Management Framework for SLO-Oriented Microservices
- [NSDI'20] Gandalf: An Intelligent, End-To-End Analytics Service for Safe Deployment in Large-Scale Cloud Infrastructure
- [FSE'20] Graph-based trace analysis for microservice architecture understanding and problem diagnosis
- [TSE'18] Fault Analysis and Debugging of Microservice Systems: Industrial Survey, Benchmark System, and Empirical Study
- [ASE'21] Graph-based Incident Aggregation for Large-Scale Online Service Systems
- [ASE'20] How Incidental are the Incidents?: Characterizing and Prioritizing Incidents for Large-Scale Online Service Systems
- [ESEC/FSE'20] Identifying linked incidents in large-scale online service systems
- [ESEC/FSE'20] Efficient incident identification from multi-dimensional issue reports via meta-heuristic search
- [ESEC/FSE'20] Real-time incident prediction for online service systems
- [ESEC/FSE'20] How to mitigate the incident? an effective troubleshooting guide recommendation technique for online service systems
- [ICSE'20] Understanding and Handling Alert Storm for Online Service Systems
- [HotOS'19] What bugs cause production cloud incidents?
- [ASE'19] Continuous Incident Triage for Large-Scale Online Service Systems
- [ICSE'19] An empirical investigation of incident triage for online service systems
- [WWW'19] Outage Prediction and Diagnosis for Cloud Service Systems
- [KDD'14] Correlating Events with Time Series for Incident Diagnosis
- [CUHK] Loghub
- [Microsoft Azure] Azure Public Dataset
- [Tsinghua] AIOps Challenge Dataset
- [Google] Cluster Traces
- [Backblaze] Hard Drive Dataset
- [Baidu] SMART Dataset of PAKDD CUP 2020
- [Coursera] Cloud-Based Network Design & Management Techniques
- [Tsinghua] AIOps Course of Tsinghua