Chaos Testing in Kubernetes

Table of Contents

Create a Kubernetes Cluster
Deploy the Application
Chaos Toolkit
Istio
- Install Istio
- Experiment: Kill a Dependent Service
Gremlin
- Install Gremlin
- Experiment: Kill a Dependent Service

This repo shows how to do Chaos Testing with applications deployed in Kubernetes.

This is a follow up from https://github.com/aws-samples/aws-workshop-for-kubernetes/tree/master/03-path-application-development/310-chaos-engineering.

Create a Kubernetes Cluster

Install kops
```
brew update && brew install kops
```

Create an S3 bucket and setup KOPS_STATE_STORE:

aws s3 mb s3://kubernetes-aws-io
export KOPS_STATE_STORE=s3://kubernetes-aws-io

Define an envinronment variable for Availability Zones for the cluster:

export AWS_AVAILABILITY_ZONES="$(aws ec2 describe-availability-zones --query 'AvailabilityZones[].ZoneName' --output text | awk -v OFS="," '$1=$1')"

Create a 3 master and 6 worker cluster:

kops create cluster \
  --name=chaos.k8s.local \
  --master-count=3 \
  --node-count=6 \
  --zones=$AWS_AVAILABILITY_ZONES \
  --yes

Verify the cluster and check that the masters are spread across multiple AZs.

$ kops validate cluster
Using cluster from kubectl context: chaos.k8s.local

Validating cluster chaos.k8s.local

INSTANCE GROUPS
NAME			ROLE	MACHINETYPE	MIN	MAX	SUBNETS
master-us-east-1a	Master	m3.medium	1	1	us-east-1a
master-us-east-1b	Master	m3.medium	1	1	us-east-1b
master-us-east-1c	Master	c4.large	1	1	us-east-1c
nodes			Node	t2.medium	6	6	us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1e,us-east-1f

NODE STATUS
NAME				ROLE	READY
ip-172-20-102-194.ec2.internal	master	True
ip-172-20-110-113.ec2.internal	node	True
ip-172-20-148-247.ec2.internal	node	True
ip-172-20-182-151.ec2.internal	node	True
ip-172-20-216-171.ec2.internal	node	True
ip-172-20-42-110.ec2.internal	node	True
ip-172-20-54-242.ec2.internal	master	True
ip-172-20-78-200.ec2.internal	master	True
ip-172-20-82-162.ec2.internal	node	True

Your cluster chaos.k8s.local is ready

Check that the nodes are spread across multiple AZs:

aws ec2 describe-instances --query 'Reservations[].Instances[].Placement.AvailabilityZone'

Deploy the Application

The application is defined at https://github.com/aws-samples/aws-microservices-deploy-options.

Note	Make sure to install the specified version of Helm. Helm 2.9.0 is not able install charts. helm/helm#4001 has more details.

Install the Helm CLI: brew install kubernetes-helm
Switch to the version 2.8.1: brew switch kubernetes-helm 2.8.1
Install Helm in Kubernetes cluster: helm init

Create Service Account and provide permissions to deploy the application:

kubectl create serviceaccount --namespace kube-system tiller
  serviceaccount "tiller" created

kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller
  clusterrolebinding "tiller-cluster-rule" created

kubectl patch deploy --namespace kube-system tiller-deploy -p '{"spec":{"template":{"spec":{"serviceAccount":"tiller"}}}}'
  deployment "tiller-deploy" patched

Install the Helm chart: helm install --name myapp charts/myapp

Access the application:

curl http://$(kubectl get svc/myapp-webapp -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')

Later, delete the Helm chart: helm delete --purge myapp

Chaos Toolkit

Install Chaos Tooklkit

brew install python3
pip3 install chaostoolkit
chaos --version

Experiment: Kill a Dependent Service (Greeting)

Setup

Populate the WEBAPP_URL environment variable with the URL of your cluster’s webapp-service endpoint:

export WEBAPP_URL="http://$(kubectl get svc/myapp-webapp -o jsonpath={.status.loadBalancer.ingress[0].hostname})/"

Run

Run the experiment:

chaos run experiments/greeting-kill.json

Diagnosis

The output of the chaos run command shows that the experiment was run but there is a weakness in the system. When the myapp-greeting service is killed, the myapp-webapp endpoint returns a response took greater than 3 seconds allowed. This is out of bounds of the tolerance for the system to be observed as still in steady-state.

TODO: How do I start the service back during rollback?

Fix

How do we define a circuit-breaker?

Experiement: Kill a Dependent Service (Name)

Experiment: Send 10 concurrent requests with different Name

Experiment: Kill Nodes in one AZ

Experiment: Kill Nodes in two AZ

Experiment: Kill Masters in one AZ

Experiment: Kill All Masters

Cleanup

Delete the Helm chart:

helm delete --purge myapp

Istio

Install Istio

Install Istio (complete details at Istio quick start):

curl -L https://git.io/getLatestIstio | sh -
cd istio-*
export PATH=$PWD/bin:$PATH
kubectl apply -f install/kubernetes/istio.yaml

Inject Envoy proxy as sidecar in each pod:

How to inject Istio for an application deployed using Helm chart?

Srini555/kubernetes-chaos

Chaos Testing in Kubernetes

Create a Kubernetes Cluster

Deploy the Application

Chaos Toolkit

Install Chaos Tooklkit

Experiment: Kill a Dependent Service (Greeting)

Setup

Run

Diagnosis

Fix

Experiement: Kill a Dependent Service (Name)

Experiment: Send 10 concurrent requests with different Name

Experiment: Kill Nodes in one AZ

Experiment: Kill Nodes in two AZ

Experiment: Kill Masters in one AZ

Experiment: Kill All Masters

Cleanup

Istio

Install Istio

Experiment: Kill a Dependent Service

Setup

Run

Diagnosis

Fix

Gremlin

Install Gremlin

Experiment: Kill a Dependent Service

Setup

Run

Diagnosis

Fix