1. 单指标异常检测

Papers Tags Links
Cross-dataset Time Series Anomaly Detection for Cloud Systems ATC2019 Paper
Time-Series Anomaly Detection Service at Microsoft KDD 2019,Spectral Residual Paper
Robust KPI Anomaly Detection for Large-Scale Software Services with Partial Labels ISSRE 2021, PU learning, active learning Paper
Efficient KPI Anomaly Detection Through Transfer Learning for Large-Scale Web Services JSAC 2022, transfer learning,time series clustering Paper

2. 多指标异常检测

Papers Tags Links
Robust Anomaly Detection for Multivariate Time Series through Stochastic Recurrent Neural Network KDD 2019,Neural networks,Bayesian network models Paper
Detecting Outlier Machine Instances through Gaussian Mixture Variational Autoencoder with One Dimensional CNN TC 2021, 1D-CNN,GMVAE Paper
Multivariate Time Series Anomaly Detection and Interpretation using Hierarchical Inter-Metric and Temporal Embedding KDD 2021,Neural networks,Bayesian network models,Latent variable models Paper
Jump-Starting Multivariate Time Series Anomaly Detection for Online Service Systems ATC 2021, transfer learning,time series clustering Paper

3. 日志分析与异常检测

Papers Tags Links
LogAnomaly: Unsupervised Detection of Sequential and Quantitative Anomalies in Unstructured Logs IJCAI 2018 Paper
A Survey on Automated Log Analysis for Reliability Engineering CSUR 2021,log compression,log parsing,log mining Paper
Log-based Anomaly Detection Without Log Parsing ASE 2021,Log Parsing,Deep Learning Paper
MoniLog: An Automated Log-Based Anomaly Detection System for Cloud Computing Infrastructures ICDE 2021,Log Instability, Log Parsing Paper
Log-based Anomaly Detection with Deep Learning: How Far Are We? ICSE 2022,Log Parsing,Deep Learning Paper

4. 调用链分析与异常检测

Papers Tags Links
Latent Error Prediction and Fault Localization for Microservice Applications by Learning from System Trace Logs ESES/FSE 2019 Paper
Unsupervised Detection of Microservice Trace Anomalies through Service-Level Deep Bayesian Networks ISSRE 2020 Paper
Practical Root Cause Localization for Microservice Systems via Trace Analysis IWQoS 2021 Paper
TraceCRL: Contrastive Representation Learning for Microservice Trace Analysis FSE 2022
Unsupervised Anomaly Detection on Microservice Traces through Graph VAE WWW 2023 Paper

5. 故障分类

Papers Tags Links
Fingerprinting the Datacenter: Automated Classification of Performance Crises EuroSys 2010 Paper
Taking the Blame Game out of Data Centers Operations with NetPoirot SIGCOMM 2016 Paper
Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning TPDS 2019 Paper
Proctor: A Semi-Supervised Performance Anomaly Diagnosis Framework for Production HPC Systems HPCC 2021 Paper
Identifying Root-Cause Metrics for Incident Diagnosis in Online Service Systems ISSRE 2021 Paper
Diagnosing Root Causes of Intermittent Slow Queries in Cloud Databases VLDB 2020 Paper

6. 根因定位

Papers Tags Links
AutoMAP: Diagnose Your Microservice-based Web Applications Automatically WWW 2020,anomaly diagnosis Paper
MicroRank: End-to-End Latency Issue Localization with Extended Spectrum Analysis in Microservice Environments WWW 2021,PageRank,spectrum analysis Paper
MicroHECL: High-Efficient Root Cause Localization in Large-Scale Microservice Systems ICSE 2021,root cause localization,service call graph Paper
Actionable and Interpretable Fault Localization for Recurring Failures in Online Service Systems ECSE 2022,Fault Localization ,Online Service Systems,Recurring Failures Paper
Eadro: An End-to-End Troubleshooting Framework for Microservices on Multi-source Data ICSE 2023,root cause localization Paper
Robust Failure Diagnosis of Microservice System through Multimodal Data TSE2023,Failure diagnosis,Multimodal data,Graph neural network Paper