Zebrium Kubernetes Demo Environment

The purpose of this repository is to build a realistic app environment running multiple services on a Kubernetes cluster. And then run a series of chaos experiments to see if an Autonomous Monitoring solution (without any pre-configuration) can automatically detect any incidents caused by the chaos experiments.

It makes it easy to spin up a fully deployed GKE cluster with a microservice application Sock Shop, Kafka and Litmus Chaos Engine to create incident scenarios for testing Zebrium's Autonomous Monitoring Tool.

For more background, please read Using Autonomous Monitoring with Litmus Chaos Engine on Kubernetes.

After cloning this repository, installing the requirements listed below, and using the start command to create the fully deployed cluster, you will be able to run Litmus Chaos experiments using the test command in the cluster and get more insight into the failures using Zebrium's unsupervised Machine Learning which will detect incidents and their root cause created by the Litmus experiments.

This repositry is also a reference for configuring and running Litmus Chaos Engine experiments, and you can find all the experiment configuration under the /litmus directory of this repository and the script to deploy and run them in manage.py.

It currently only works with GKE so you will need a Google Cloud account to run this environment, but support for Amazon and Azure is planned in future.

Requirements

Python 3.7 or above
Python Dependencies: pip install -r requirements.txt
Free Zebrium account to collect logs: https://www.zebrium.com
Google Cloud Login: https://console.cloud.google.com/
GCloud CLI installed locally and logged in: https://cloud.google.com/sdk/docs/quickstarts
Kubectl installed locally: https://kubernetes.io/docs/tasks/tools/install-kubectl/
Helm installed locally: https://helm.sh/docs/intro/install/

IMPORTANT: Before running the Chaos Experiments you will also need to adjust the Refractory Period in your Advanced Account Settings to 10 minutes. This is because the experiments run close together in succession which is not how real world incidents occur and stops multiple experiments being grouped into one incident in Zebrium. You can adjust it at https://portal03.zebrium.com/Settings/advanced as soon as the cluster has been started and some data has been ingested.

Usage

To see full command line options use the -h flag:

./manage.py -h

This will output the following:

usage: manage.py [-h] {start,test,stop} ...

Spin up Zebrium Demo Environment on Kubernetes.

positional arguments:
  {start,test,list,stop}
    start               Start a GKE Cluster with Zebrium's demo environment
                        deployed.
    test                Run Litmus ChaosEngine Experiments inside Zebrium's
                        demo environment.
    list                List all available Litmus ChaosEngine Experiments
                        available to run.
    stop                Shutdown the GKE Cluster with Zebrium's demo
                        environment deployed.

Startup

To start the GKE cluster and deploy all the required components:

./manage.py start --project {GC_PROJECT} --key {ZE_KEY}

Test

To run all the Litmus ChaosEngine experiments:

./manage.py test

You can optionaly add the --wait= argument to change the wait time between experiments in minutes. By default it is 20 minutes to ensure Zebrium doesn't cluster incidents together into a single incident.

To run a specific experiment (found under the ./litmus directory):

./manage.py test --test=container-kill

Available Experiments

container-kill: https://docs.litmuschaos.io/docs/container-kill
disk-fill: https://docs.litmuschaos.io/docs/disk-fill
kafka-broker-pod-failure: https://docs.litmuschaos.io/docs/kafka-broker-pod-failure/
pod_delete: https://docs.litmuschaos.io/docs/pod-delete
pod-network-corruption: https://docs.litmuschaos.io/docs/pod-network-corruption

Notes

To view application deployment picked, success/failure of reconcile operations (i.e., creation of chaos-runner pod or lack thereof), check the chaos operator logs. Ex:

kubectl logs -f chaos-operator-ce-6899bbdb9-jz6jv -n litmus

To view the parameters with which the experiment job is created, status of experiment, success of chaosengine patch operation and cleanup of the experiment pod, check the logs of the chaos-runner pod. Ex:

kubectl logs sock-chaos-runner -n sock-shop

To view the logs of the chaos experiment itself, use the value retain in .spec.jobCleanupPolicy of the chaosengine CR

kubectl logs container-kill-1oo8wv-85lsl -n sock-shop

To re-run the chaosexperiment, cleanup and re-create the chaosengine CR

kubectl delete chaosengine sock-chaos -n sock-shop
kubectl apply -f litmus/chaosengine.yaml

List

Lists all the available Litmus Chaos Experiments in this repo under the ./litmus directory:

./manage.py list

Shutdown

To shutdown and destroy the GKE cluster when you're finished:

./manage.py stop --project {GC_PROJECT}

zebrium/zebrium-kubernetes-demo