/chaos-engineering

Chaos engineering tools for Gardener-managed clusters

Primary LanguagePythonApache License 2.0Apache-2.0

Chaos Engineering Tools for Gardener-Managed Clusters

reuse compliant

Introduction

This package provides Gardener-independent chaostoolkit modules to simulate compute and network outages for various cloud providers as well as pod disruptions for any Kubernetes cluster.

Gardener users benefit from an additional module that leverages the generic modules, but exposes their functionality in the most simple, homogeneous, and secure way (no need to specify cloud provider credentials, cluster credentials, or filters explicitly; retrieves credentials and stores them in memory only):

Cloud Providers

Read more on how to simulate compute and network outages for these cloud providers here:

The API, parameterization, and implementation is as homogeneous as possible across the different cloud providers, so that consumers of these packages have only minimal effort. However, if you are a Gardener user, please read on and use the Gardener-specific module instead, which makes it even easier and safer for you.

Kubernetes

Read more on how to disrupt pods here:

The module supports powerful filter criteria like node labels, pod labels, pod metadata like kind or name, or pod owner reference. However, if you are a Gardener user, please read on and use the Gardener-specific module instead, which makes it even easier and safer for you.

Gardener

Whether you want to target cloud provider resources or pods, if you have a Gardener-managed cluster, this package is for you as it supports all of the above, but in the most simple, homogeneous, and secure way (no need to specify cloud provider credentials, cluster credentials, or filters explicitly; retrieves credentials and stores them in memory only):

Human Interactions

Finally, there is a tiny additional module that is primarily useful for human invocation of chaostoolkit experiments (e.g. first assess the would-be impacted machines, wait for human user confirmation, then actually start the zone outage):

Getting Started

If you are new to chaostoolkit and its terminology and tools, please check out our getting started tutorial. It will show you how to use this package in combination with the chaostoolkit CLI and experiment files.

Please check out our Python scripting tutorial, if you rather want to use the package directly in your chaos testing Python scripts.

If you are an experienced chaostoolkit user, please read on to pick up only the essentials.

Installation, Usage, and Configuration

This package was developed and tested with Python 3.9+ and is being published to PyPI. You may want to create a virtual environment before installing it with pip.

pip install chaosgarden

If you want to use the VMware vSphere module, please note the remarks in requirements.txt for vSphere. Those are not contained in the published PyPI package.

For usage and configuration of the individual modules, please see the detailed docs on the modules listed above.

This package is based on chaostoolkit and to some degree also on some of its incubation extensions (requirements included within the chaosgarden package). It can also be used directly from Python scripts and supports this mode with additional convenience that helps launch actions and probes in background, so that you can compose also complex scenarios with ease.

If you intend to use it in combination with the chaostoolkit CLI and experiment files, you will have to install the CLI first and make yourself familiar with it.

Here some links for further reading:

In some cases, we extended the original upstream open source incubator extensions significantly and we may eventually contribute those changes back upstream, if the community is interested.

Implementing High Availability and Tolerating Zone Outages

Developing highly available workload that can tolerate a zone outage is no trivial task. You can find more information on how to achieve this goal here. While many recommendations are general enough, the examples are specific in how to achieve this in a Gardener-managed cluster and where/how to tweak the different control plane components. If you do not use Gardener, it may be still a worthwhile read.

Thank you for your interest in Gardener chaos engineering and making your workload more resilient.