Even before we get into the definition of Chaos Engineering or why it has become important, let's take a look at traditional approach. Most of the applications and configuration would be put under stress testing to find out the breakage point. This primarily helped to assure the operations team that the provisioned capacity is enough for the anticipated workload. The tests was relatively (if not fairly) simple to do. But with time there are couple of things that has changed:
- System have become more and more complex now
- Workloads can change abruptly and scaling up and down is a necessity now
Also, there is a philosophical shift happenning the way IT operations used to think -
-
Servers are disposable - Earlier the basic deployment units (in most cases physical or virtual servers) were treated like "Pets" and the configuration changes would lead to a snowflake. Now with configuration management tools servers are disposable like "cattles" and can be resurrected from scratch if there is a configuration change aka Pheonix Servers.
-
Failure have been accepted as business as usual, outages are not. I am not trying to force you to accept system failures, but most of the IT operations today acknowledges that things would go wrong. Simply put, one needs to be prepared for it.
-
Because of the explosion of internet, services are not limited by geographies any more. Workloads are not predictible any more and they are bound to go beyond the breakage point of one servers, it is just a matter of time and chance.
-
Complexity of applications has increased multi-fold. Today applications are not just three tier deployments. A web page rednered might be working with 10s or in some cases 100s of micro-services in the backend. Only way test the resiliency of the system is by injecting random issues on purpose.
This all lead the IT Operation leads to be convinced that the best way to be prepared for an outage is to simulate one. If you are not convinced yet, perhaps you want to read a bit about the study of how much loss the business can suffer because of infrastructure outage.
So what should be your strategy? I believe the easiest way is to introduce unit testing and integration testing for infrastructure and architecture components too, just like application code. so for any kind of High Availability or Disaster Recovery approach you have implemented, you should have a test case. e.g. if you are having a cluster with 2 nodes, your test case could be be shoot down one of the node. Yes, you read it right. I am suggesting that you should take down a node. There is no other way for you to test high availability but to simulate failure. Similarly you can test scalability but injecting slowness and network congetion.
There are many popular examples and inspirations for Chaos Injection. Most popular one are:
- Generic guidelines are available on Principles of Chaos Engineering
- Netflix's Chaos Monkey to do various kind of chaos injection e.g. introduce slowness in the network, kill EC2 instances, detach the network or disks from EC2 instances
- Netflix's Chaos Kong though is not open sourced yet but a nice inspiration and aspiration for anyone embarking on chaos engineering within their enterprise.
- Facebook's Project Storm
Those who practice chaos engineering by trying to break themselves, have been rewarded well in times of outages. Best example is how Netflix weathered the storm by preparing for the worst.
In today's date a lot of new applications and services are being deployed as containers. If you are starting up with Chaos Engineering in Docker, there are many different mechanisms and tools available at your disposal.
Before we get into tools, let's look at some of the basic features of Docker which should be helpful to you.
It is often better to deploy your application as a Swarm Service instead of deploying them as native container. In case you are using Kubernetes, it is better to deploy your request as a sevice. Both the definitions are declarative and define the desired state of service. This is really helpful in maintaining the uptime of your application as the service would always try to maintain the availability of service.
In this example, I am going to use a Dockerfile to build a new image and then I will be using it to deploy a new service. The example is executed against a Docker UCP cluster from a client node (with docker cli and UCP Client Bundle).
Setup a docker build file Dockerfile-nohc
:
FROM nginx:latest
RUN apt-get -qq update
COPY index.html /usr/share/nginx/html
EXPOSE 80 443
CMD ["nginx", "-g", "daemon off;"]
Build your image Image
sh-4.2$ docker image build -t $dtr_url/development/tweet_to_us:demoMay -f Dockerfile-nohc .
Sending build context to Docker daemon 4.096kB
ip-10-100-2-106: Step 1/4 : FROM nginx:latest
ip-10-100-2-106: ---> ae513a47849c
ip-10-100-2-106: Step 2/4 : COPY index.html /usr/share/nginx/html
ip-10-100-2-106: ---> Using cache
ip-10-100-2-106: ---> b97207424f3a
ip-10-100-2-106: Step 3/4 : EXPOSE 80 443
ip-10-100-2-106: ---> Using cache
ip-10-100-2-106: ---> bfe4f59a2094
ip-10-100-2-106: Step 4/4 : CMD nginx -g daemon off;
ip-10-100-2-106: ---> Using cache
ip-10-100-2-106: ---> cb79c6283bb5
ip-10-100-2-106: Successfully built cb79c6283bb5
ip-10-100-2-106: Successfully tagged dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay
Now we need to push you image to a repository (DTR or Dockerhub), so that it is available to all nodes:
sh-4.2$ docker image push $dtr_url/development/tweet_to_us:demoMay
The push refers to a repository [dtr.ashnikdemo.com:12443/development/tweet_to_us]
c75bed55c5fa: Pushed
7ab428981537: Mounted from development/tweet-to-us
82b81d779f83: Mounted from development/tweet-to-us
d626a8ad97a1: Mounted from development/tweet-to-us
demoMay: digest: sha256:08090c853df56ceee495fb95537ac9f2c81cf8718e5fc76c513ba1d8e7d145f0 size: 1155
Now we will start a service using this image:
sh-4.2$ docker service create -d --name=twet-app --mode=replicated --replicas=2 --publish 8080:80 dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay
pq6eojqprru4ctw0ib0lwfmj6
This request asks the Swarm cluster to setup the service with --mode=replicated
and --replicas=2
i.e. Swarm would try to maintain two tasks for this service at any point of time, unless requested otherwise by the user. You can inspect the tasks running for the service with docker service ps
command:
sh-4.2$ docker service ps twet-app
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
zzq1jgolcc2o twet-app.1 dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay ip-10-100-2-67 Running Running 3 minutes ago
zlkf4ejuxus8 twet-app.2 dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay ip-10-100-2-93 Running Running 3 minutes ago
As you can see there are two tasks running and these tasks would be setup with VIP which will do load-balancing among the two containers/tasks.
sh-4.2$ docker service inspect --format='{{.Endpoint}}' twet-app
{{vip [{ tcp 80 8080 ingress}]} [{ tcp 80 8080 ingress}] [{f80zlxoy56y20ql48o3v9aiwo 10.255.0.225/16}]}
Let's try to kill one of the underlying containers and see if Swarm is able to maintain the declarative state we had requested:
sh-4.2$ docker container ls | grep -i twet-app
603c7f8940fe dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay "nginx -g 'daemon ..." 7 minutes ago Up 7 minutes 80/tcp, 443/tcp ip-10-100-2-67/twet-app.1.zzq1jgolcc2oyucexn4j9u9pq
54aa164ea509 dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay "nginx -g 'daemon ..." 7 minutes ago Up 7 minutes 80/tcp, 443/tcp ip-10-100-2-93/twet-app.2.zlkf4ejuxus851onp4i2t143p
sh-4.2$
sh-4.2$ docker container kill 603c7f8940fe
603c7f8940fe
sh-4.2$
sh-4.2$
sh-4.2$ docker service ps twet-app
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
sp4hz64oytu0 twet-app.1 dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay ip-10-100-2-67 Running Running 2 seconds ago
zzq1jgolcc2o \_ twet-app.1 dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay ip-10-100-2-67 Shutdown Failed 7 seconds ago "task: non-zero exit (137)"
zlkf4ejuxus8 twet-app.2 dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay ip-10-100-2-93 Running Running 8 minutes ago
sh-4.2$
As you can see the container 603c7f8940fe
was used by one of the tasks of our service twet-app
and once we kill the container, Swarm tries to maintain the state by starting another task.
Note: Pushing image to repository is needed when you are running with distributed setup. As you can see above in the build was done on one of the nodes from the Swarm clusterip-10-100-2-106
and image would be only available on only one node. Hence if we were to run service without pushing the image to a repository, there is good chance that the tasks would get started on the same node (ip-10-100-2-106
) i.e. the only node that has access to the image or different nodes would get different images (left by different image builds). Swarm does a good job of reminding us about this. Here is an example if I tried to run the servie without pushing the image:
sh-4.2$ docker service create -d --name=twet-app --mode=replicated --replicas=2 --publish 8080:80 dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay
image dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay could not be accessed on a registry to record
its digest. Each node will access dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay independently,
possibly leading to different nodes running different
versions of the image.
t46gb1wi3tc7xs2j08egzcut1
Docker allows you to use healthcheck to keep a tab on the health of running containers. The healthcheck can be either baked into you image during the build process using HEALTHCHECK
direction in Dockerfile or during runtime using --healthcheck option with docker service create
or docker container run
To quote the docker documentation
The
HEALTHCHECK
instruction tells Docker how to test a container to check that it is still working. This can detect cases such as a web server that is stuck in an infinite loop and unable to handle new connections, even though the server process is still running.
Note: The HEALTHCHECK
feature was added in Docker 1.12.
To make use of this feature we will add a new command to our Dockerfile
now
HEALTHCHECK --interval=30s --timeout=3s --retries=2 \
CMD python /usr/share/nginx/html/healthcheck.py || exit 1
This means that the healthcheck command python /usr/share/nginx/html/healthcheck.py
will be run for the first time after 30s
i.e. 30 seconds after starting up the tasks. The healthcheck will be run with an interval
of every 30s after that. The healthcheck would timeout
in 3s
and upon failure of 2 retries
the container will be declared unhealthy.
We will have to add a few new files to support HEALTHCHECK
healthcheck.py
- our own little piece of code to check the health of container.healthcheck.html
Now we will build and push the image
sh-4.2# docker image build --no-cache -t $dtr_url/development/tweet_to_us:demoMay_Healthcheck -f Dockerfile .
Sending build context to Docker daemon 7.68kB
Step 1/7 : FROM nginx:latest
---> b175e7467d66
Step 2/7 : RUN apt-get -qq update
---> Running in 152a3156632c
---> 2a6be94d9a04
Removing intermediate container 152a3156632c
Step 3/7 : RUN apt-get -qq --allow-downgrades --allow-remove-essential --allow-change-held-packages install python > /dev/null
---> Running in 56a5b9141aaf
debconf: delaying package configuration, since apt-utils is not installed
---> 99605506e79f
Removing intermediate container 56a5b9141aaf
Step 4/7 : COPY healthcheck.html healthcheck.py index.html /usr/share/nginx/html/
---> b1b93b73d0fa
Removing intermediate container dab1d03a75e4
Step 5/7 : EXPOSE 80 443
---> Running in 50b63022a6c3
---> 4297f32f769b
Removing intermediate container 50b63022a6c3
Step 6/7 : HEALTHCHECK --interval=30s --timeout=3s --retries=2 CMD python /usr/share/nginx/html/healthcheck.py || exit 1
---> Running in 1a1a9cd1f139
---> 042010177008
Removing intermediate container 1a1a9cd1f139
Step 7/7 : CMD nginx -g daemon off;
---> Running in 767a8098f177
---> 9153fcd78222
Removing intermediate container 767a8098f177
Successfully built 9153fcd78222
Successfully tagged dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck
sh-4.2# docker image push dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck
The push refers to a repository [dtr.ashnikdemo.com:12443/development/tweet_to_us]
a60e00b623bb: Pushed
2e83bcd5bc8d: Pushed
5134599f00a1: Pushed
77e23640b533: Pushed
757d7bb101da: Pushed
3358360aedad: Pushed
demoMay_Healthcheck: digest: sha256:a4fb4fd2733e37ae7282148ccb497aac4c2fc18a74aa8e950271ffd648b07da8 size: 1579
Now once we deploy the service, initially the health status would be starting
until the first healthcheck is initiated
sh-4.2$ docker service rm twet-app
twet-app
sh-4.2$ docker service create -d --name=twet-app --mode=replicated --replicas=2 --publish 8080:80 dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck
lbfmu7vxa1i6arfmpstzq3rer
sh-4.2$ docker container ls | grep -i twet
1feb5ed8e0b6 dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck "nginx -g 'daemon ..." 23 seconds ago Up 23 seconds (health: starting) 80/tcp, 443/tcp ip-10-100-2-93/twet-app.2.rkh6gofzfru83wjqcyzq2mdcl
6ccb4d691fe9 dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck "nginx -g 'daemon ..." 24 seconds ago Up 23 seconds (health: starting) 80/tcp, 443/tcp ip-10-100-2-67/twet-app.1.urb4v2vttrlsvcz11wfnj6yh2
After the first healthechk, the healthcheck status would be healthy
sh-4.2$ docker container ls | grep -i twet
1feb5ed8e0b6 dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck "nginx -g 'daemon ..." About a minute ago Up About a minute (healthy) 80/tcp, 443/tcp ip-10-100-2-93/twet-app.2.rkh6gofzfru83wjqcyzq2mdcl
6ccb4d691fe9 dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck "nginx -g 'daemon ..." About a minute ago Up About a minute (healthy) 80/tcp, 443/tcp ip-10-100-2-67/twet-app.1.urb4v2vttrlsvcz11wfnj6yh2
Now let's try to force a distruption by connecting to one of the containers and changing the content of healthcheck.html
sh-4.2$ docker container exec -it 1feb5ed8e0b6 bash
root@1feb5ed8e0b6:/# echo test > /usr/share/nginx/html/healthcheck.html
root@1feb5ed8e0b6:/# exit
Soon (in about 1 minute given our interval
, timeout
and retries
configuration in the Dockerfile), the container will be reported unhealthy and replaced with a new container to run the task
sh-4.2$ docker container ls | grep -i twet
1feb5ed8e0b6 dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck "nginx -g 'daemon ..." 3 minutes ago Up 13 minutes (unhealthy) 80/tcp, 443/tcp ip-10-100-2-93/twet-app.2.xskxfd9n6e39wlghpm0k7tphr
6ccb4d691fe9 dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck "nginx -g 'daemon ..." 13 minutes ago Up 13 minutes (healthy) 80/tcp, 443/tcp ip-10-100-2-67/twet-app.1.urb4v2vttrlsvcz11wfnj6yh2
sh-4.2$ docker container ls -a | grep -i twet
efc5b969ef8c dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck "nginx -g 'daemon ..." 42 seconds ago Up 36 seconds (healthy) 80/tcp, 443/tcp ip-10-100-2-93/twet-app.2.xskxfd9n6e39wlghpm0k7tphr
1feb5ed8e0b6 dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck "nginx -g 'daemon ..." 11 minutes ago Exited (0) 41 seconds ago ip-10-100-2-93/twet-app.2.rkh6gofzfru83wjqcyzq2mdcl
6ccb4d691fe9 dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck "nginx -g 'daemon ..." 11 minutes ago Up 11 minutes (healthy) 80/tcp, 443/tcp ip-10-100-2-67/twet-app.1.urb4v2vttrlsvcz11wfnj6yh2
You can also override the command to check health, its frequency and retries while creating the service
sh-4.2$ docker service rm twet-app
twet-app
sh-4.2$ docker service create -d --name=twet-app \
--mode=replicated --replicas=2 --publish 8080:80 \
--health-cmd "python /usr/share/nginx/html/healthcheck.py || exit 1" \
--health-interval 10s \
--health-retries 2 \
--health-timeout 30ms \
dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck
mj4lel34whrvupscq8sjt7g5m
In runtime while creating a service, you can disable the healtcheck with --no-healthcheck
option. That will supress any healthcheck which has been defined in the base image
sh-4.2$ docker service create -d --name=twet-app \
--mode=replicated --replicas=2 --publish 8080:80 \
--no-healthcheck \
dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck
nxkv39smzq5k0o9tgwmc74t2c
If the base container you are going to use has a HEALTHCHECK
defined, it can also disable the healthchek during build time using HEALTHCHECK NONE
You can use docker container inspect
command to further review the state of your containers and details healthchekc command output:
e.g. in case of timeout error:
sh-4.2$ docker container inspect --format='{{json .State.Health}}' a9486dc964af
{"Status":"starting","FailingStreak":1,"Log":[{"Start":"2018-05-12T17:18:01.698087531Z","End":"2018-05-12T17:18:01.728282187Z","ExitCode":-1,"Output":"Health check exceeded timeout (30ms)"}]}
in case of failures:
sh-4.2$ docker container inspect --format='{{json .State.Health}}' 62ad34709fb8
{"Status":"healthy","FailingStreak":1,"Log":[{"Start":"2018-05-12T17:28:33.714393794Z","End":"2018-05-12T17:28:33.793534206Z","ExitCode":0,"Output":""},{"Start":"2018-05-12T17:28:53.793900452Z","End":"2018-05-12T17:28:53.871217425Z","ExitCode":1,"Output":"The content of the healthcheck did not match. Expected Content-\"healthy\", we got: test\n"}]}
sh-4.2$ docker container inspect --format='{{json .State.Health}}' 62ad34709fb8
{"Status":"unhealthy","FailingStreak":2,"Log":[{"Start":"2018-05-12T17:28:33.714393794Z","End":"2018-05-12T17:28:33.793534206Z","ExitCode":0,"Output":""},{"Start":"2018-05-12T17:28:53.793900452Z","End":"2018-05-12T17:28:53.871217425Z","ExitCode":1,"Output":"The content of the healthcheck did not match. Expected Content-\"healthy\", we got: test\n"},{"Start":"2018-05-12T17:29:13.871399894Z","End":"2018-05-12T17:29:13.948097443Z","ExitCode":1,"Output":"The content of the healthcheck did not match. Expected Content-\"healthy\", we got: test\n"}]}
in case of no failures
sh-4.2$ docker container inspect --format='{{json .State.Health}}' 181d566f6aa9
{"Status":"healthy","FailingStreak":0,"Log":[{"Start":"2018-05-12T17:25:06.184822447Z","End":"2018-05-12T17:25:06.262241844Z","ExitCode":0,"Output":""},{"Start":"2018-05-12T17:25:26.262408086Z","End":"2018-05-12T17:25:26.338883823Z","ExitCode":0,"Output":""},{"Start":"2018-05-12T17:25:46.339143953Z","End":"2018-05-12T17:25:46.416973058Z","ExitCode":0,"Output":""},{"Start":"2018-05-12T17:26:06.417170336Z","End":"2018-05-12T17:26:06.495295881Z","ExitCode":0,"Output":""},{"Start":"2018-05-12T17:26:26.495482044Z","End":"2018-05-12T17:26:26.572278146Z","ExitCode":0,"Output":""}]}
Note: The output will contain a friendly message if one is printed by your healthcheck command.
Now that we have covered the basic building blocks of chaos engineering with Docker, let's try to take a look at some tools. Pumba is a fairly new but quite promising tool for chaos orchestration. Best thing is it works well with a Swarm cluster, you just need to point it to the manager node. We can easily get it to work with Docker UCP Client Bundle.
First we need to setup an isolated network where we will setup our application and test it out docker network create -d overlay tweet-app-net
Now let's setup a service using healthcheck from the previous examples
docker service create -d --name=twet-app --network tweet-app-net \
--mode=replicated --replicas=2 --publish 8080:80 \
--health-cmd "python /usr/share/nginx/html/healthcheck.py || exit 1" \
--health-interval 20s \
--health-retries 2 \
--health-timeout 200ms \
dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck
Let's ensure that the service has been started properly with requested number of replicas which are healthy
sh-4.2$ docker container ls | grep -i twet
75b2bf6f219d dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck "nginx -g 'daemon ..." 27 seconds ago Up 21 seconds (healthy) 80/tcp, 443/tcp ip-10-100-2-93/twet-app.1.im7f7qm2xh6fk6uqla462qzia
393355d083fb dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck "nginx -g 'daemon ..." About a minute ago Up 59 seconds (healthy) 80/tcp, 443/tcp ip-10-100-2-67/twet-app.2.6uueh28nxj7btpfzffeq40f6b
Now let's use pumba to randomly kill some containers under the service
export SVC_NAME=twet-app
pumba --random kill $(docker service ps --no-trunc \
--filter "desired-state=Running" \
${SVC_NAME} | awk ' {if (NR!=1) {print $2"."$1} } ')
You will an output confirming that the container has been killed
sh-4.2$ pumba --random kill $(docker service ps --no-trunc \
--filter "desired-state=Running" \
${SVC_NAME} | awk ' {if (NR!=1) {print $2"."$1} } ')
INFO[0000] Kill containers
INFO[0003] Killing /twet-app.2.6uueh28nxj7btpfzffeq40f6b (393355d083fbd33d8247e6cf9dcdb36046000764547db776b405bb4c37ef7438) with signal SIGKILL
You will notice that as soon as the container is killed, the swarm manager would try to restore the state back to desired state i.e. with 2 healthy replica
sh-4.2$ docker container ls | grep -i twet
75b2bf6f219d dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck "nginx -g 'daemon ..." 42 seconds ago Up 36 seconds (healthy) 80/tcp, 443/tcp ip-10-100-2-93/twet-app.1.im7f7qm2xh6fk6uqla462qzia
sh-4.2$ docker container ls | grep -i twet
75b2bf6f219d dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck "nginx -g 'daemon ..." 44 seconds ago Up 38 seconds (healthy) 80/tcp, 443/tcp ip-10-100-2-93/twet-app.1.im7f7qm2xh6fk6uqla462qzia
sh-4.2$
sh-4.2$ docker container ls | grep -i twet
dfb8b7ebc559 dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck "nginx -g 'daemon ..." 7 seconds ago Up 1 seconds (health: starting) 80/tcp, 443/tcp ip-10-100-2-67/twet-app.2.7505c070tcyk14wudbdbufy3t
75b2bf6f219d dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck "nginx -g 'daemon ..." 47 seconds ago Up 41 seconds (healthy) 80/tcp, 443/tcp ip-10-100-2-93/twet-app.1.im7f7qm2xh6fk6uqla462qzia
You can also try to stop or remove a container various commands provided by pumba
.
You can also use --interval
option to run the command at a regular interval to perform stress testing. e.g. to run the same kill command every 10minutes
export SVC_NAME=twet-app
pumba --random --interval 10m kill $(docker service ps --no-trunc \
--filter "desired-state=Running" \
${SVC_NAME} | awk ' {if (NR!=1) {print $2"."$1} } ')
Let's first take example of a simple setup with a single node.
docker swarm init
Setup the service by running this command aginst the single manager node of your newly initiated Swarm Cluster
sh-4.2# docker service create -d --name=twet-app --network tweet-app-net --mode=replicated --replicas=2 --publish 8080:80 --health-cmd "python /usr/share/nginx/html/healthcheck.py || exit 1" --health-interval 10s --health-retries 2 --health-timeout 100ms dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck
image dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck could not be accessed on a registry to record
its digest. Each node will access dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck independently,
possibly leading to different nodes running different
versions of the image.
pb7teb13m2oczlp8rub0wjkdo
Fire a pumba command to introduce delays
sh-4.2# pumba --random netem --interface lo --duration 60s --tc-image gaiadocker/iproute2 delay --time 10 jitter 100 --distribution normal $(docker service ps --no-trunc \
--filter "desired-state=Running" \
${SVC_NAME} | awk ' {if (NR!=1) {print $2"."$1} } ')
INFO[0000] netem: delay for containers
INFO[0000] Running netem command '[delay 10ms 10ms 20.00]' on container 2eb65f467edc585e586feee01d3eba36c301bb3830a9e163e81aa4edf8d5f36c for 1m0s
INFO[0000] Start netem for container 2eb65f467edc585e586feee01d3eba36c301bb3830a9e163e81aa4edf8d5f36c on 'lo' with command '[delay 10ms 10ms 20.00]'
Monitor the status for containers running the of for the service:
sh-4.2# docker container ls | grep twet-app
2eb65f467edc dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck "nginx -g 'daemon ..." About a minute ago Up About a minute (healthy) 80/tcp, 443/tcp twet-app.2.eyxe2yo9fs928b6x2oa3q26m0
031b9ec08b50 dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck "nginx -g 'daemon ..." 11 minutes ago Up 11 minutes (healthy) 80/tcp, 443/tcp twet-app.1.k6wfqvzyn5p7ka60witz92msv
You will notice that becuase of the network delays introduced by pumba, the containers are failing the healthcheck:
sh-4.2# docker container ls | grep twet-app
2eb65f467edc dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck "nginx -g 'daemon ..." About a minute ago Up About a minute (unhealthy) 80/tcp, 443/tcp twet-app.2.eyxe2yo9fs928b6x2oa3q26m0
031b9ec08b50 dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck "nginx -g 'daemon ..." 11 minutes ago Up 11 minutes (healthy) 80/tcp, 443/tcp twet-app.1.k6wfqvzyn5p7ka60witz92msv
Soon the unhealthy container would be removed:
sh-4.2# docker container ls | grep twet-app
031b9ec08b50 dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck "nginx -g 'daemon ..." 11 minutes ago Up 11 minutes (healthy) 80/tcp, 443/tcp twet-app.1.k6wfqvzyn5p7ka60witz92msv
And it will be replaced with a new container:
sh-4.2# docker container ls | grep twet-app
2c1453b4d680 dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck "nginx -g 'daemon ..." 5 seconds ago Up Less than a second (health: starting) 80/tcp, 443/tcp twet-app.2.oc900fzo7a7v1kgz2kt45h1pr
031b9ec08b50 dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck "nginx -g 'daemon ..." 11 minutes ago Up 11 minutes (healthy) 80/tcp, 443/tcp twet-app.1.k6wfqvzyn5p7ka60witz92msv
As soon as the healthcheck is executed, it will turn into a healthy one:
sh-4.2# docker container ls | grep twet-app
2c1453b4d680 dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck "nginx -g 'daemon ..." 20 seconds ago Up 15 seconds (healthy) 80/tcp, 443/tcp twet-app.2.oc900fzo7a7v1kgz2kt45h1pr
031b9ec08b50 dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck "nginx -g 'daemon ..." 11 minutes ago Up 11 minutes (healthy) 80/tcp, 443/tcp twet-app.1.k6wfqvzyn5p7ka60witz92msv
sh-4.2#
While the container is being replaced, you will notice that pumba command would fail (as the container it attached to has been lost)
INFO[0060] Stopping netem on container 2eb65f467edc585e586feee01d3eba36c301bb3830a9e163e81aa4edf8d5f36c
INFO[0060] Stop netem for container 2eb65f467edc585e586feee01d3eba36c301bb3830a9e163e81aa4edf8d5f36c on 'lo'
ERRO[0060] Error response from daemon: cannot join network of a non running container: 2eb65f467edc585e586feee01d3eba36c301bb3830a9e163e81aa4edf8d5f36c
ERRO[0060] Error response from daemon: cannot join network of a non running container: 2eb65f467edc585e586feee01d3eba36c301bb3830a9e163e81aa4edf8d5f36c
As you can see, pumba was able to introduce network delay and HEALTHCHECK
in the image or --health-cmd
at service level helped us to restart the images which were slowing. Well, at this time this is the most that Pumba and Swarm can do. I am hoping in times to come, Swarm service healthcheck would allow us to define auto-scale policies too.
Now, if we are running against a UCP setup or any "true" swarm cluster which has worker and manager nodes, pumba netem command would not work when you fire it from a client. This is unlike the kill command (or most of the other pumba commands), which do work against a Swarm cluster. I came up with a simple solution to work around it.
Well you can run pubma in a container as the example says on it's github page.
# once in a 10 seconds, try to kill (with `SIGTERM` signal) all containers named **hp(something)**
# on same Docker host, where Pumba container is running
$ docker run -d -v /var/run/docker.sock:/var/run/docker.sock gaiaadm/pumba pumba --interval 10s kill --signal SIGTERM ^hp
This means that we can create, a service that runs on each node in your Swarm cluster and executes pumba netem command. We need to change the entrypoint
of the service and mount /var/run/docker.sock
of the local node to container so that pumba can have access to docker deamon on each node.
The pumba command should essentially look for containers that belong to your service only so you need to pass a list of containers to entrypoint pumba command.
export container_list=$(docker service ps --no-trunc --filter "desired-state=Running" ${SVC_NAME} | awk ' {if (NR!=1) {print $2"."$1} } ')
The command should try to inject delay only in specific interface i.e. the one used by HEALTHCHECK
.
export netem_interface=lo
Now let's run our pumba netem service
docker service create -d --restart-condition none --mode global --name pumba-netem-delay \
--mount type=bind,source=/var/run/docker.sock,destination=/var/run/docker.sock \
--entrypoint "pumba --random netem --interface ${netem_interface} --duration 60s \
--tc-image gaiadocker/iproute2 delay \
--time 10 jitter 100 \
--distribution normal ${container_list}" \
gaiaadm/pumba
The effect will be same as the previous example we run on one node Swarm Cluster.
If you are scripting this, then introduce a delay and then cleanup the swarm service:
sleep 60
docker service rm pumba-netem-delay
To be added
One of the reasont to run your containers in a Swarm cluster is to ensure fault tolderance to node failrues. Let's try to simulate node failure and see how docker UCP manager handles it.
Let's first list various tasks of our application:
docker service ps twet-app
Output would something like below, giving you details of the number of tasks, their id and node on which they are running:
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
s8iib1ue7nrd twet-app.1 dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck ip-10-100-2-67 Running Running 42 minutes ago
oiafp1o6klxx twet-app.2 dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck ip-10-100-2-93 Running Running 42 minutes ago
For the purpose of our testing let's try to fail one of the nodes, let's say ip-10-100-2-67
.
Since I am running in AWS, I will find out the instance id of the server and restart. We can use docker node ls
before and after restart, to note the node status
sh-4.2$ docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
3du1xn000h3jz3t2fcx9lcvdl ip-10-100-2-38 Ready Active
ag12n6ejw7ztf0yqpsao4208u ip-10-100-2-115 Ready Active
awql5xr67h0jmxjllzfohqqy2 * ip-10-100-2-15 Ready Active Reachable
e2soqi2u67nfnoxop8mgfvm7a ip-10-100-2-169 Ready Active Reachable
i2fjh10bx31ij6i3q2jvzwjco ip-10-100-2-40 Ready Active Leader
lpi6z3np5vp83vatmh51d3i59 ip-10-100-2-106 Ready Active
m4j4g27conj199uciw98k5h1b ip-10-100-2-67 Ready Active
mzwnamamkze7yagqa602tmd71 ip-10-100-2-93 Ready Active
o3z2xxqo90mm4dlnpq632zorj ip-10-100-2-66 Ready Active
yczj5bg55l37xfkugkwwyc5ji ip-10-100-2-70.ap-southeast-1.compute.internal Ready Active
sh-4.2$ aws ec2 describe-instances --filters "Name=network-interface.private-dns-name,Values=ip-10-100-2-67.ap-southeast-1.compute.internal" | grep -i InstanceId
"InstanceId": "i-0db2edf9253157f97",
sh-4.2$ aws ec2 reboot-instances --instance-ids i-0db2edf9253157f97
sh-4.2$ docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
3du1xn000h3jz3t2fcx9lcvdl ip-10-100-2-38 Ready Active
ag12n6ejw7ztf0yqpsao4208u ip-10-100-2-115 Ready Active
awql5xr67h0jmxjllzfohqqy2 ip-10-100-2-15 Ready Active Reachable
e2soqi2u67nfnoxop8mgfvm7a * ip-10-100-2-169 Ready Active Reachable
i2fjh10bx31ij6i3q2jvzwjco ip-10-100-2-40 Ready Active Leader
lpi6z3np5vp83vatmh51d3i59 ip-10-100-2-106 Ready Active
m4j4g27conj199uciw98k5h1b ip-10-100-2-67 Down Active
mzwnamamkze7yagqa602tmd71 ip-10-100-2-93 Ready Active
o3z2xxqo90mm4dlnpq632zorj ip-10-100-2-66 Ready Active
yczj5bg55l37xfkugkwwyc5ji ip-10-100-2-70.ap-southeast-1.compute.internal Ready Active
sh-4.2$
As you can see the node became unavailable once the reboot was executed
In order to maintain the desired state of service with 2 replica, Swarm manager would start a new container on one of the surviving nodes
sh-4.2$ docker service ps twet-app
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
7e2icqhk0n54 twet-app.1 dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck ip-10-100-2-93 Running Running 12 seconds ago
s8iib1ue7nrd \_ twet-app.1 dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck m4j4g27conj199uciw98k5h1b Shutdown Running 50 seconds ago
oiafp1o6klxx twet-app.2 dtr.ashnikdemo.com:12443/development/tweet_to_us:demoMay_Healthcheck ip-10-100-2-93 Running