This lab shows you how to run some basic chaos engineering experiments on Amazon Elastic Kubernetes Service or EKS.
The examples build on the existing chaostoolkit-demos repository, but uses an EKS cluster rather than a self-hosted cluster.
You will need an AWS account with sufficient permissions to create an EKS repository and related constructs.
First, clone the chaostoolkit-demos repository into a working directory.
export WORKDIR=<path to a new working directory>
mkdir -p $WORKDIR
cd $WORKDIR
git clone https://github.com/chaosiq/chaostoolkit-demos.git
Note that these instructions are based on that repository at commit 0db25e0
.
Next, clone this repository:
git clone https://github.com/aws-samples/chaos-engineering-on-amazon-eks
The second repository has additional labs and supporting material. Let's copy it into the first repository's directory.
cd chaostoolkit-demos
cp ../chaos-engineering-on-amazon-eks/eks.yaml .
cp -r ../chaos-engineering-on-amazon-eks/lab6 .
cp -r ../chaos-engineering-on-amazon-eks/lab7 .
cp -r ../chaos-engineering-on-amazon-eks/apps/cw apps/
cp ../chaos-engineering-on-amazon-eks/manifests/eks.yaml manifests
You will need to install eksctl
and make sure that you have enough free room in your account to create a new VPC. See the quick setup guide for more details.
Then run:
eksctl create cluster -f eks.yaml
This creates an EKS cluster with a managed node group, and configures kubectl
to use the new cluster as the default context.
Install Prometheus, but change to a different working directory first.
cd $WORKDIR
git clone https://github.com/prometheus-operator/kube-prometheus
cd kube-prometheus
kubectl apply -f manifests/setup/
kubectl apply -f manifests/
Check deployment status:
kubectl -n monitoring get all
Run:
cd $WORKDIR
curl -sSL https://mirrors.chaos-mesh.org/v1.0.2/install.sh | bash
Check deployment status:
kubectl -n chaos-testing get all
The dashboard is at:
kubectl -n chaos-testing port-forward --address 0.0.0.0 service/chaos-dashboard 2333:2333
http://localhost:2333/dashboard/overview
Run:
cd $WORKDIR
cd chaostoolkit-demos
kubectl apply -f manifests/traefik.yaml
Run:
pip install chaostoolkit-kubernetes chaostoolkit-prometheus chaostoolkit-addons jsonpath2
Go to a temporary directory and run:
cd /tmp
wget https://github.com/tsenart/vegeta/releases/download/v12.8.4/vegeta_12.8.4_linux_386.tar.gz
tar -zxf vegeta_12.8.4_linux_386.tar.gz
sudo cp ./vegeta /usr/local/bin/
sudo chmod +x /usr/local/bin/vegeta
brew update && brew install vegeta
The default applications are already built. Just run:
cd $WORKDIR
cd chaostoolkit-demos
kubectl apply -f manifests/all.yaml
First we need to set up port forwarding for the ingress.
kubectl port-forward --address 0.0.0.0 service/traefik-ingress-service 30080:80
Now run the experiment.
rm -f lab1/vegeta_results.json
export PYTHONPATH=`pwd`/ctkextensions
chaos run lab1/experiment.json
Let's scale the middle service to fix the problem.
kubectl scale --replicas=2 deployment/middle
rm -f lab1/vegeta_results.json
chaos run lab1/experiment.json
This lab has no definitive resolution. It could result in a fail-fast degradation or an effort to put requests into a queue for later service.
chaos run --rollback-strategy=always lab2/experiment.json
This experiment does not seem to return an error properly. It should produce an error because the back-end service is cut off from the network.
chaos run --rollback-strategy=always lab3/experiment.json
This experiment repeats Lab 3 but with Prometheus used for monitoring.
kubectl -n monitoring port-forward --address 0.0.0.0 service/prometheus-k8s 9090:9090
export PROMETHEUS_URL=http://localhost:9090
chaos run --rollback-strategy=always lab4/experiment.json
This experiment adds a safeguard so the experiment doesn't get out of control.
kubectl apply -f manifests/failingapp.yaml
This experiment terminates a random node out of the worker group.
First, get the autoscaling group name for the EKS node group from the EKS console.
Next, check the list of existing instances:
aws ec2 describe-instances --filters Name=tag:eks:cluster-name,Values=chaoscluster | jq '.Reservations[].Instances[] | .InstanceId + " - " + .State.Name'
In the experiment file, edit the following fields:
aws_region
aws_profile_name
asg_names
Now run the experiment:
chaos run lab6/experiment.json
Confirm that one of the instances is stopped. It'll be replaced when the health check times out.
aws ec2 describe-instances --filters Name=tag:eks:cluster-name,Values=chaoscluster | jq '.Reservations[].Instances[] | .InstanceId + " - " + .State.Name'
This experiment takes away permission for a pod to send metrics to CloudWatch.
To start we need to build a new Docker image. This simple application publishes metrics to CloudWatch.
cd apps/cw
./build_and_push.sh cwpod
Note that image URL output, and substitute it on line 24 of manifests/cw.yaml
. Now apply the manifest:
cd $WORKDIR
cd chaostoolkit-demos
kubectl -n cw-metric-writer apply -f manifests/cw.yaml
Set up port forwarding for port 8000:
kubectl -n cw-metric-writer port-forward --address 0.0.0.0 service/cwpod 8000:8000
Go to http://localhost:8000
and refresh the page a few times. Now in CloudWatch you should see some metrics in the chaos
namespace, under the app=cw
dimension.
Next, find the role that we use for the service account for this pod. The name will look something like this:
eksctl-chaoscluster-addon-iamserviceaccount-Role1-G9YIIYY2SK4H
Enter the role name in the role_name
field in lab7/experiment.json
.
Also edit these fields in lab7/experiment.json
:
aws_region
aws_profile_name
Now run the experiment:
chaos run lab7/experiment.json
The experiment should fail, as the pod is no longer allowed to write to CloudWatch.
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.