/awesome-failure-diagnosis

Related resources for incident failure diagnosis research.

Intelligent Failure Diagnosis for Microservice Systems

Linter

Reading

Anomaly Detection

Root cause analysis / Fault Localization

Others Paper

Misc

Big tech cloud incident status

Sample microservices systems

Dataset

Tools

Metrics

Logs

Traces

Load generators

  • Locust: a load testing tool for web applications. It is used to simulate a large number of users accessing a web application simultaneously, in order to test its performance and scalability.
  • Vegeta: HTTP load testing tool and library. It's over 9000!
  • Jmeter: a testing tool used to test the performance of web applications, databases, and APIs. It can simulate a heavy load on a server, group of servers, network, or object to test its strength or to analyze overall performance under different load types. It can also be used to test the performance of different protocols such as HTTP, FTP, TCP, JDBC, and JMS.
  • Stress-ng: a tool that can be used to stress test various aspects of a Linux system, such as the CPU, memory, I/O, and network.
  • wrk2: HTTP workload generator.

Chaos Engineering / Fault Injection

  • Chaos Mesh: an open-source chaos engineering platform for Kubernetes. It provides a set of APIs and CLI tools that allow users to define and orchestrate chaos experiments, such as network latency injection, pod failure, etc.
  • TC (Traffic Control): Delay and drop packets.
  • tc-netem (Network Emulator): an enhancement of the Linux traffic control facilities that allow one to add delay, packet loss, duplication and more other characteristics to packets outgoing from a selected network interface. NetEm is built using the existing Quality Of Service (QOS) and Differentiated Services (diffserv) facilities in the Linux kernel.
  • https://github.com/Netflix/chaosmonkey
  • ChaosBlade: a performance testing and analysis tool for microservices. It allows users to simulate various types of failures and network conditions, such as network delays and packet loss, to test the resilience and stability of microservices in a controlled environment.
  • Strace: a diagnostic, debugging and instructional userspace utility for Linux. It is used to monitor and tamper with interactions between processes and the Linux kernel, which include system calls, signal deliveries, and changes of process state.
  • Chaos Toolkit: a CLI tool which helps to run Chaos Engineering experiments.
  • Chaos Genius

Academia

Conferences and Journals

  • A* Ranked Conference: ICSE | FSE | ASE | WWW | KDD | NeurIPS
  • A Ranked Conference: ICSME | ICPC | ESEM | RE | MSR | ISSTA | SANER | ICST | ISSRE
  • Top Q1 Journal: IEEE TSE

[Check Conference Rank][Check Journal Rank][Check Paper Acceptance Status]

Researcher

Video

TODO: focus on finding primary source

Others