Chaos
Chaos is a ladder.
We will explore basic fault and event injection to understand how a simple service responds to different events. This workshop will be a bit more exploratory than some of our other workshops, so this will be a good chance to practice using Docker.
Setup
Pull the image needed for this workshop.
bakerx pull chaos CSC-DevOps/Images#Spring2020
Clone this repository.
git clone https://github.com/CSC-DevOps/Chaos
Open a terminal inside the directory and install npm packages.
cd chap
npm install
# One more inner set of packages for dashboard.
cd dashboard
npm install
Inside the Chaos\chap
directory, start the instances:
cd chap
node index.js up
After the instances intialize, start the dashboard service.
node index.js serve
You should see a dashboard available on your host machine http://localhost:8080/
Workshop
Control and Canary server.
Your control server is running a service with several endpoints, including:
- http://192.168.44.102:3080/ --- displays a simple hello world message.
- http://192.168.44.102:3080/stackless --- computes a medium workload.
- http://192.168.44.102:3080/work --- computes a heavy workload.
Inside the server, three docker containers provide simple node express servers that service these requests. A reverse-proxy will route uses a simple round robin algorithm to load balance the requests between the containers.
One more endpoint exists for the dashboard:
- http://192.168.44.102:3080/health --- returns basic metrics, including memory, cpu, and latency of services.
Finally, the canary server runs the same exact services, but at http://192.168.66.108/
Dashboard
The dashboard displays your running infrastructure. The container's overall health is visualized by a small colored square.
A line chart displays statistics gathered from the /health
endpoint. A sidebar menu displays some options for updating the line chart. By clicking the "CPU Load" menu item, you will see the CPU load be compared between the two servers. By clicking the "Latency" menu item, you will see the Latency as calculated by the average time for a request to be serviced between successive health check calls.
You can visualize the effects of traffic on the metrics by visiting the servers, either using a load script, curl command, or siege. For example, you can see a brief spike on the control server by running:
siege -b -t30s http://192.168.44.102:3080/stackless
The dashboard can fill out with events once left running long enough. By reloading the page or changing metrics, it will reset.
Tools of Chaos
Inside the /chaos directory, we will find several scripts that will in general cause mayhem, including:
- Burning up the CPU.
- Filling up the disk.
- Dropping random packets.
- Killing processes/containers.
Many of the scripts use traffic control, an advanced tool for setting networking policy. For example, running this will corrupt 50% of network packets.
tc qdisc add dev enp0s8 root netem corrupt 50%
Experimentation
While breaking things are fun, we usually want to do so in a principled manner, so we can learn about how our infrastructure handles failure and discover unexpected results.
💥
Burning up the CPU of a single docker container. We will conduct a simple experiment where we will induce a heavy CPU load on container within the green canary server. We will then observe differences in metrics across the control and canary server.
Adding chaos
Enter the green canary server (bakerx ssh greencanary
). Then, let's open up a shell within one of the docker containers:
docker exec -it app1 sh
Then, let's add the cpu script, and execute it:
vi chaos_cpu.sh
chmod +x chaos_cpu.sh
Run it.
./chaos_cpu.sh
Observations
Induce load on the green canary server.
siege -b -t30s http://192.168.66.108:3080/stackless
Now, induce load on the control server.
siege -b -t30s http://192.168.44.102:3080/stackless
You should see something like this:
What did we see? We see a large increase in latency in green canary server, meaning client requests are taking much longer (and may be timing out).
What have we learned? Because our round-robin load balancer ignores cpu load, it will continue to route requests to an overloaded instance, who will take much longer in servicing requests. The implication is that we should be mindful of avoiding routing to overloaded instances, which can increase quality of service.
🚦
Network traffic Let's start another experiment, were we mess with the network traffic.
Inside our green canary server (bakerx ssh greencanary
), run the following:
/bakerx/chaos_network_corruption.sh
Alternatively, you can also directly run:
tc qdisc add dev enp0s8 root netem corrupt 50%
Task: Try inducing load on the green canary server, now. What does it look like? You might see something like this:
Once you are done, you can reset the connection with:
/bakerx/chaos_reset.sh
😵
Killing and starting containers Inside the green canary server, you can stop the running containers with docker stop
.
Stop the app1 and app2 containers;
docker stop app1 app2
Conduct a brief experiment where you compare the performance difference between the control server (with three containers) and the green canary server with 1 container.
You can restore the containers again by starting them up with the following commands:
docker run --rm --name app1 -d -p 127.0.0.1:3005:3000/tcp app-server
docker run --rm --name app2 -d -p 127.0.0.1:3006:3000/tcp app-server
# docker run --rm --name app3 -d -p 127.0.0.1:3007:3000/tcp app-server
🔽
Squeeze Testing By default a Docker container allocates unlimited cpu and memory. Try limiting the available cpu and memory settings with running the container. You can use these parameters:
--cpus=".5"
-m 8m
Most of these options take a positive integer, followed by a suffix of b, k, m, g, to indicate bytes, kilobytes, megabytes, or gigabytes.
Notice anything interesting?
⛽
Filling disks -
Inside one of the containers (e.g., using
docker exec -it app3 sh
), add and run the command to fill the disk:./fill_disk.sh /fill 2G
. -
Try to create a file inside another container. What happened?
-
Kill the container and start it again.
Reflection
How could you extend this workshop to collect more measures and devise an automated experiment to understand which event/failure causes the most problems?