Awesome Chaos Engineering

Testing in production (TiP) is gaining steam as an accepted practice in DevOps and testing communities, but no amount of preproduction QA testing can foresee all the possible scenarios in your real production deployment.

The prevailing wisdom is that you will see failures in production; the only question is whether you'll be surprised by them or inflict them intentionally to test system resilience and learn from the experience.

The latter approach is chaos engineering.

To understand all this knowledge is very important have a good background in Chaos Engineering, containers, fault injection, monitoring and observability.

0. Introduction
1. Chaos in Practice
2. Principles of Chaos Engineering
3. Fault Injection
4. Observability
5. Incident Management Tool
6. Cost of SEVs
7. Chaos As A Sevice
8. Gamedays
9. Forums and Groups
10. References
11. License
12. Contributing

0. Introduction

Chaos engineering is defined as "the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production" (Principles of Chaos Engineering, http://principlesofchaos.org/).

In other words, it's a software testing method focusing on finding evidence of problems before they are experienced by users.

It's a common misconception that chaos engineering is only about randomly breaking things in production. It's not. Although running experiments in production is a unique part of chaos engineering (more on that later), it's about much more than that—anything that helps us be confident the system can withstand turbulence.

IMPORTANT!: Chaos engineering is not just about randomly breaking things ;-)

It interfaces with site reliability engineering (SRE), application and systems performance analysis, and other forms of testing.

Practicing chaos engineering can help you prepare for failure, and by doing that, learn to build better systems, improve existing ones, and make the world a safer place.

Motivations for chaos engineering

There are at least three good reasons to implement chaos engineering:

Determining risk and cost and setting service-level indicators, objectives, and agreements
Testing a system (often complex and distributed) as a whole
Finding emergent properties you were unaware of

1. Chaos in Practice

To specifically address the uncertainty of distributed systems at scale, Chaos Engineering can be thought of as the facilitation of experiments to uncover systemic weaknesses. These experiments follow four steps:

Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior.
Hypothesize that this steady state will continue in both the control group and the experimental group.
Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.
Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.

The harder it is to disrupt the steady state, the more confidence we have in the behavior of the system. If a weakness is uncovered, we now have a target for improvement before that behavior manifests in the system at large.

2. Principles of Chaos Engineering

A chaos experiment is defined as the following five points by the Principles of chaos engineering

Build a Hypothesis around Steady State Behavior
Vary Real-world Events
Run Experiments in Production
Automate Experiments to Run Continuously
Minimize Blast Radius

3. Fault Injection

Generic Tools

The Simian Army - A suite of tools for keeping your cloud operating in top form.
Chaos Monkey - A resiliency tool that helps applications tolerate random instance failures.
Chaos Toolkit - A chaos engineering toolkit to help you build confidence in your software system.
Chaos Toolkit Turbulence - This is an extension for Chaos Toolkit which adds support for Turbulence attacks.
Monarch - This is a series of tools for Chaos Toolkit.
Muxy - A chaos testing tool for simulating a real-world distributed system failures.
Chaos Blade - Chaosblade is an experimental tool that follows the principles of Chaos Engineering and is used to simulate common fault scenarios, helping to improve the recoverability of faulty systems and the fault tolerance of faults.
Cthulhu - Chaos Engineering tool that helps evaluating the resiliency of microservice systems simulating various disaster scenarios against a target infrastructure in a data-driven manner.
Namazu - Programmable fuzzy scheduler for testing distributed systems.
Chaos Scimmia - Chaos Engineering for Redis.
HavocLeopard - A set of simple chaos engineering apps that can be used to royally screw up your on-prem servers.
Arcdata - Open source incident management and volunteer scheduling application for Red Cross Disaster Services.
AWS Chaos Scripts - Collection of python scripts to run failure injection on AWS infrastructure.
Simoorg - Simoorg is linkedin’s very own failure inducer framework, It was designed to be easy to extend and most of the important components are pluggable.

CPU's

Cpu Troll - Dedicated to raising CPU latency by the requested percentage and timespan.

Memory

totalChaos - This will overload ram, start moving opened windows everywhere, if the user enter CTRL+ALT+DEL it will put infinite command prompts.

File system

Disk

Networking

Toxiproxy - A TCP proxy to simulate network and system conditions for chaos and resiliency testing.
Comcast - A tool designed to simulate common network problems like latency, bandwidth restrictions, and dropped/reordered/corrupted packets.
Chaos HTTP Proxy - Introduce failures into HTTP requests via a proxy server.

Security

Infection Monkey - Open source security tool for testing a data center's resiliency to perimeter breaches and internal server infection. The Monkey uses various methods to self propagate across a data center and reports success to a centralized Monkey Island server.
ChaoSlingr - Introducing Security Chaos Engineering. ChaoSlingr focuses primarily on the experimentation on AWS Infrastructure to proactively instrument system security failure through experimentation.
Mitigant - Security chaos engineering for cloud cyber resilience.

Languages

Compilation time

ChaosCat - Chaos engineering for Pull Requests - Taking a not-even-good joke a bit too far.

Runtime

Byteman - A Swiss Army Knife for Byte Code Manipulation.
Byte-Monkey - Bytecode-level fault injection for the JVM. It works by instrumenting application code on the fly to deliberately introduce faults like exceptions and latency.
Perses - A project to cause (controlled) destruction to a JVM application.
Wiremock - API mocking (Service Virtualization) which enables modeling real world faults and delays.
MockLab - API mocking (Service Virtualization) as a service which enables modeling real world faults and delays.
Flaw - Inject failures on api calls for local chaos engineering.
Havoc - Havoc is a collection of dangerous code that wreck havoc in .NET applications and the operating system for chaos-engineering.
Utilities for frontend chaos engineering - Utilities for frontend chaos engineering.
CHAOS GOPHER - A collection of unix style tools in GO to do chaos engineering or testing.
Chaos Monkey for Spring Boot - Injects latencies, exceptions, and terminations into Spring Boot applications.
React Chaos - Chaos Engineering for your React apps.
Vue Chaos - A simple (yet chaotic) component to introduce chaos in your Vue app.
Chaos QoaLa - ChaosQoaLa is a chaos engineering tool for injecting failure into JavaScript backed GraphQL end points.
Chaos Reverse-engineering - Chaos engineering approach by Reverse-engineering.
Fault - The fault package provides go http middleware that makes it easy to inject faults into your service.
GORM SQLChaos - GORM SQLChaos manipulates DML at program runtime based on gorm callbacks
Chaos Frontend Toolkit - A set of tools to break your web apps and, in doing so, find ways to improve them.

Database

RedFI - RedFI acts as a proxy between the client and Redis with the capability of injecting faults on the fly, based on the rules given by you.

Virtual Machine

ChaosMachine - Tool to do chaos engineering at the application level in the JVM.
TripleAgent - System for fault injection for Java applications. .

Containers & Orchestrators

ChaosOrca - Tool for doing Chaos Engineering on containers by perturbing system calls for processes inside containers.
POBS - Automatic Observability and Chaos for Dockerized Java Applications.
Pumba - Chaos testing and network emulation for Docker containers (and clusters).
Blockade - Docker-based utility for testing network failures and partitions in distributed applications.
Chaos Engineering for Docker - Chaos Engineering for Docker.
Chaos Engineering with Docker EE - Chaos Engineering with Docker EE.
Chaos Util - Docker image with utilities for Chaos Engineering.
Drax - DC/OS Resilience Automated Xenodiagnosis tool. It helps to test DC/OS deployments by applying a Chaos Monkey-inspired, proactive and invasive testing approach.
Pod-Reaper - A rules based pod killing container. Pod-Reaper was designed to kill pods that meet specific conditions that can be used for Chaos testing in Kubernetes.
Chaoskube - ChaosKube periodically kills random pods in your Kubernetes cluster.
Litmus - Framework for Kubernetes environments that enables users to run test suites, capture logs, generate reports and perform chaos tests.
Chaos Operator - Chaos engineering via kubernetes operator.
Kube Entropy - A little chaos engineering application for kubernetes resilience testing.
kubernetes-chaos-lab - A brief guide to setting up your first chaos engineering lab on Kubernetes!.
Chaos Mesh - A Chaos Engineering Platform for Kubernetes.

Hypervisors

VMware Mangle - Orchestrating Chaos Engineering.
Turbulence - Tool focused on BOSH environments capable of stressing VMs, manipulating network traffic, and more. It is very simmilar to Gremlin.
Chaos Lemur - This project is a self-hostable application to randomly destroy virtual machines in a BOSH-managed environment.

Kernel & Operating System

Cloud

Chaos Engine - Chaos Engine is an application for creating random Chaos Events in cloud applications to test resiliency.

Private Cloud

Glooshot - Chaos engineering framework to help you Immunize your service mesh.
kube-monkey - An implementation of Netflix's Chaos Monkey for Kubernetes clusters.
Powerful Seal - PowerfulSeal adds chaos to your Kubernetes clusters, so that you can detect problems in your systems as early as possible. It kills targeted pods and takes VMs up and down.
KubeInvaders - Gamfied Chaos engineering tool for Kubernetes Clusters.
Kube DOOM - The next level of chaos engineering is here! Kill pods inside your Kubernetes cluster by shooting them in Doom.
GomJabbar - ChaosMonkey for your private cloud.
kubethanos - kubethanos kills half of your pods randomly to engineer chaos in your preferred environment, gives you the opportunity to see how your system behaves under failures.
krkn - Chaos and resiliency testing tool for Kubernetes and OpenShift.
kube-burner - Kube-burner is a Kubernetes performance and scale test orchestration toolset.
Chaos Controller - The Chaos Controller is a Kubernetes controller with which you can inject various systemic failures, at scale, and without caring about the implementation details of your Kubernetes infrastructure.

Amazon AWS

Testing Amazon Aurora Using Fault Injection Queries - Testing Amazon Aurora Using Fault Injection Queries.
Chaos SSM Documents - Collection of AWS SSM Documents to perform Chaos Engineering experiments.
failure-lambda - failure-lambda is a small Node module for injecting failure into AWS Lambda.
chaos_lambda - chaos_lambda is a small library injecting chaos into AWS Lambda.
AWSSSMChaosRunner - AWSSSMChaosRunner is a library which simplifies failure injection testing and chaos engineering for EC2 and ECS (with EC2 launch type).

Azure Cloud

Azure Fault Analysis Service
Include controlled Chaos in Service Fabric clusters - Include controlled Chaos in Service Fabric clusters.
chaos-dingo - Monkey and Lemur are taken, so Chaos Dingo it is. This is a tool to mess with Azure services using the Azure NodeJS SDK.
Chaos Lambda - Randomly terminate ASG instances during business hours.

Google Cloud Platform

Chaos Engineering on Google Cloud Platform - Chaos Engineering on Google Cloud Platform.

Examples Projects

A Chaos Engineering Bootcamp - A Chaos Engineering Bootcamp.
HW4 - Express servers were used to implement service topologies.
Serverless Chaos Engineering Demo - This example demonstrates how to use Adrian Hornsby's Failure Injection Layer to perform chaos engineering experiments on a serverless environment.
Chaos Engineeing Demo - Simple project demonstrating chaos engineering with Chaos Monkey and Resiliance4J.
Chaos Engineering Demo - resilience4j + chaos toolkit + wiremock + chaos monkey for spring boot sample application.
How to Create a Kubernetes Cluster on Ubuntu 16.04 with kudeadm and Weave Net

4. Observability

Specific tools

General Use

My Awesome Observability Repo ;-)

5. Incident Management Tool

Banjaxed - Open source incident management tool.

6. Cost of SEVs

Availability Calculator - Calculate how much downtime should be permitted in your SLA.

7. Chaos As A Sevice

Gremlin Inc. - Failure as a Service.
Chaos Engineering Experiment Automation - Chaos Engineering Experiment Automation.
Pystol.org - The cloud chaos engineering toolbox.
Chaos Platform - Chaos Engineering Platform for Everyone.
Chaos Hub - Chaos Hub stands on the shoulders of the Chaos Toolkit to provide a complete, user-friendly, platform to automate and collaborate on your Chaos Engineering and Resiliency efforts.
steadybit - Chaos Engineering platform that helps to proactively reduce downtime and provide visibility into systems to detect issues.
Cavisson - Chaos engineering platform.

8. Gamedays

Target: What is a Gameday? - Chaos Gamedays experience by Target.
Codecentric: Chaos Engineering Gamedays - Chaos Gamedays by Codecentric.
New Relic: How to run a Gameday? - Chaos Gamedays experience by New Relic.
Dius: Gamedays resources - Resources for getting started with GameDay and Chaos Engineering.
Gremlin: Gamedays - Resources for getting started with GameDay and Chaos Engineering.
Gremlin: Planning your own Chaos Day - Example of a Gameday with DynamoDB by Gremlin.
Gremlin: How to run a Gameday? - Methodology to run Gamedays according Gremlin.
Gremlin DB: Breaking Dynamo DB - Example of a Gameday with DynamoDB by Gremlin.
Gremlin: Introduction to Gameday - What is a Gameday according Gremlin.
Gremlin: Inside Gremlin 2019 Gremlin Gamedays Roadmap - Chaos Gamedays experience by Gremlin.
Gremlin: What I lerned running the Chaos Lab with Kafka - Example of a Gameday with Kafka by Gremlin.
Chaos Toolkit: Chaos Engineering with Humans in the loop - Article about Chaos Gamedays.
GooCardless: All fun and games until you start with Gamedays - Article about Chaos Gamedays.
InfoQ: Gamedays - Achieving Resilience through Chaos Engineering - InfoQ Presentation with experiences about Chaos Gamedays.

9. Forums and Groups

CNCF Chaos Engineering Working Group
CNCF Chaos Engineering Working Group Slack: #chaosengineering (slack.cncf.io)
CNCF Chaos Engineering Working Group GitHub
Chaos Toolkit Slack Community

10. References

11. License

12. Contributing

Contributions welcome! Read the contribution guidelines first.

Thank you!

adriannovegil/awesome-chaos-engineering