Building and install images:
# ON LOCAL MACHINE:
# * build images and docker-compose.yml file
$ ./setup.sh MYNAME
Start up testkit and promote 2 managers:
# ON LOCAL MACHINE:
# * create a cluster with 5 macines: 3 managers and 2 workers
$ testkit create 5 0 --parallel --engine ee-2.1 --name=tk1706 --debug
$ testkit machine ssh tk1706-ubuntu-0 docker node promote tk1706-ubuntu-1
$ testkit machine ssh tk1706-ubuntu-0 docker node promote tk1706-ubuntu-2
Log into testkit:
# ON LOCAL MACHINE:
$ testkit machine ssh tk1706-ubuntu-0
Clone repo for compose and start service:
# ON LOCAL MACHINE:
# * start the test service to run during upgrade
$ (eval "$(testkit machine env tk1706-ubuntu-0)" &&
docker stack deploy -c docker-compose.yml test)
Put a tarball of upgraded moby binaries on every node:
# ON LOCAL MACHINE:
# * build the moby binaries
# (can get/upgrade this binary by some other means if desired;
# for example by debian packages. If so, adjust upgrade steps
# below.)
$ git clone https://github.com/moby/moby
$ cd moby
$ make
# * Create a tarball of binary images
$ cd bundles/binary-daemon/
$ tar -h -zcvf /tmp/engine-binaries.tgz docker-containerd docker-containerd-ctr docker-containerd-shim dockerd docker-init docker-proxy docker-runc
# * Copy the tarball to the nodes in preparation for upgrade
# + get the node IPs
$ testkit node ls
# + do this for each node IP address
$ scp -i ~/.testkit/testkit.pem /tmp/engine-binaries.tgz docker@A.B.C.D:
Upgrading a manager:
# ON ANY MANAGER:
# * drain containers from the manager
$ docker node update --availability drain $(docker node ls | grep NODENAME | awk '{print $1}')
# ON THE MANAGER TO UPGRADE:
# * verify that containers have all stopped
$ docker container ls
# *** Steps from here could differ by OS as needed. ***
# *** This is just the procedure that I used. ***
# * stop the docker service
$ sudo service docker stop
$ cd /usr/bin/
# * backup the old binaries
$ tar zcvf /home/docker/old-engine-binaries.tgz docker-containerd docker-containerd-ctr docker-containerd-shim dockerd docker-init docker-proxy docker-runc
# * install the new ones
$ sudo tar zxvf /home/docker/engine-binaries.tgz
# * restart the service
$ sudo service docker start
One must upgrade all managers before re-activating them. After upgrading them all, reactivate them to receive workloads:
# ON ANY MANAGER:
# * do this for each of the manager node names
$ docker node update --availability active $(docker node ls | grep NODENAME | awk '{print $1}')
# * check that services have successfully migrated
$ docker service ps test_service
$ docker service ps test_client
Upgrading the nodes:
# Follow the same upgrade procedure as a manager as above. But there
# is no need to drain them *all* before restarting them. One can make them
# active as soon as they are upgraded.
Check for connectivity failures:
# on local machine:
# * verify that output is in /var/lib/docker/volumes/test_cliout/_data/client.log
# Look for: "Mountpoint": "/var/lib/docker/volumes/test_cliout/_data",
# (if not, adjust path for second command)
$ testkit machine ssh tk1706-ubuntu-0 docker volume inspect test_cliout
# * cat all the output logs to see if there were any "Connectivity error:"s
$ for i in 0 1 2 3 4 ; do testkit machine ssh tk1706-ubuntu-$i sudo cat /var/lib/docker/volumes/test_cliout/_data/client.log ; done
Output after cat-ing the voumes the output should look as follows. The number of lines may vary depending on how swarm places containers in the overall network.
Fri Aug 24 15:49:38 UTC 2018: Starting client service
Fri Aug 24 15:51:15 UTC 2018: Starting client service
Fri Aug 24 16:01:09 UTC 2018: Starting client service
Fri Aug 24 15:54:21 UTC 2018: Starting client service
Fri Aug 24 15:59:26 UTC 2018: Starting client service
Fri Aug 24 15:55:28 UTC 2018: Starting client service
Fri Aug 24 15:49:38 UTC 2018: Starting client service
The output of docker service ps test_client
and
docker service ps test_serivce
should look similarly uninteresting:
docker@tk1706-ubuntu-0:~$ docker service ps test_client
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
dsfza4hcu4gp test_client.1 ctelfer/lb-upgrade-test-cli:latest tk1706-ubuntu-2 Running Running 2 minutes ago
xjnra1dwfttg \_ test_client.1 ctelfer/lb-upgrade-test-cli:latest tk1706-ubuntu-3 Shutdown Shutdown 2 minutes ago
o58t2inc3tka \_ test_client.1 ctelfer/lb-upgrade-test-cli:latest tk1706-ubuntu-2 Shutdown Shutdown 6 minutes ago
un3vbh1iv2pb \_ test_client.1 ctelfer/lb-upgrade-test-cli:latest tk1706-ubuntu-1 Shutdown Shutdown 7 minutes ago
5gp5gbt61sh9 \_ test_client.1 ctelfer/lb-upgrade-test-cli:latest tk1706-ubuntu-0 Shutdown Shutdown 11 minutes ago
ibg27ylftdgv test_client.2 ctelfer/lb-upgrade-test-cli:latest tk1706-ubuntu-1 Running Running about a minute ago
tnw3gxifz0ef \_ test_client.2 ctelfer/lb-upgrade-test-cli:latest tk1706-ubuntu-4 Shutdown Shutdown about a minute ago
docker@tk1706-ubuntu-0:~$ docker service ps test_service
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
n28ip529ru04 test_service.w232ljdni54ibvlsgewehon99 nginx:latest tk1706-ubuntu-4 Running Running 30 seconds ago
7uffy3trubyq test_service.z5tqpoosc7akmy1rooonzyqk2 nginx:latest tk1706-ubuntu-3 Running Running 2 minutes ago
qfj700xptc0d test_service.bkir1i90n88pf3n30ftx32yww nginx:latest tk1706-ubuntu-2 Running Running 5 minutes ago
d64vq365ehi2 test_service.rgwd7c9f151elfy4io9es7983 nginx:latest tk1706-ubuntu-1 Running Running 5 minutes ago
jnk6579o00si test_service.t7felwp30mvi4n910z2y3u1cu nginx:latest tk1706-ubuntu-0 Running Running 6 minutes ago
x1w1vpvlwhru test_service.w232ljdni54ibvlsgewehon99 nginx:latest tk1706-ubuntu-4 Shutdown Shutdown about a minute ago
rxktf2n2jqbb test_service.z5tqpoosc7akmy1rooonzyqk2 nginx:latest tk1706-ubuntu-3 Shutdown Shutdown 3 minutes ago
mklb6o6mcyab test_service.bkir1i90n88pf3n30ftx32yww nginx:latest tk1706-ubuntu-2 Shutdown Shutdown 6 minutes ago
m4umkff8i5w0 test_service.rgwd7c9f151elfy4io9es7983 nginx:latest tk1706-ubuntu-1 Shutdown Shutdown 8 minutes ago
z59srrwswmv8 test_service.t7felwp30mvi4n910z2y3u1cu nginx:latest tk1706-ubuntu-0 Shutdown Shutdown 11 minutes ago
More generic error lister with tests:
for i in 0 1 2 3 4
do
echo tk1706-ubuntu-$i
testkit machine ssh tk1706-ubuntu-$i \
if docker volume inspect test_cliout \> /dev/null 2\>\&1 \; then \
sudo cat /var/lib/docker/volumes/test_cliout/_data/client.log \; \
fi
done
Error seen if activating a manager before all the managers are upgraded. This same error will be seen if the workers are upgraded before the managers:
$ docker@tk1706-ubuntu-0:~$ docker service ps test_service
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
134tseerb5ay test_service.rflm00eupnjjkrpt9ot2l93eh nginx:latest tk1706-ubuntu-0 Shutdown Rejected about a minute ago "node is missing network attac…"
tuddp1ycbf38 \_ test_service.rflm00eupnjjkrpt9ot2l93eh nginx:latest tk1706-ubuntu-0 Shutdown Rejected about a minute ago "node is missing network attac…"
...
$ docker@tk1706-ubuntu-0:~$ docker inspect 134tseerb5ay
[
...
"Status": {
"Timestamp": "2018-08-23T21:18:24.469418288Z",
"State": "rejected",
"Message": "preparing",
"Err": "node is missing network attachments, ip addresses may be exhausted",
"ContainerStatus": {
"ContainerID": "",
"PID": 0,
"ExitCode": 0
},
"PortStatus": {}
},
...
]
Error seen when leaving a node as active and performing a service stop and then service start (without upgrade):
$ docker service ps test_service
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
nhk7xvotn8w8 test_service.s356gfdm8y01jtk4t8w9pvjn3 nginx:latest tk1706-ubuntu-0 Running Running 59 seconds ago
er7zhjrz3yfj \_ test_service.s356gfdm8y01jtk4t8w9pvjn3 nginx:latest tk1706-ubuntu-0 Shutdown Rejected about a minute ago "cannot create a swarm scoped …"
vxhpsu9msljq \_ test_service.s356gfdm8y01jtk4t8w9pvjn3 nginx:latest tk1706-ubuntu-0 Shutdown Rejected about a minute ago "cannot create a swarm scoped …"
s63cina9dbrb test_service.mrlgc2khq4t6dwwa6td15yjqh nginx:latest tk1706-ubuntu-1 Running Running 3 minutes ago
hib6okicy3bt test_service.s356gfdm8y01jtk4t8w9pvjn3 nginx:latest tk1706-ubuntu-0 Shutdown Shutdown about a minute ago
r3d4fwnjt41f test_service.wx0vsiypkncc33nko29ygqbzk nginx:latest tk1706-ubuntu-4 Running Running 3 minutes ago
3sf9rw14etsw test_service.qvic6463l4uog7iw9zid98avs nginx:latest tk1706-ubuntu-3 Running Running 3 minutes ago
n5roh3wwpllq test_service.q74h32eah8ewm8e2cn55d1hdm nginx:latest tk1706-ubuntu-2 Running Running 3 minutes ago
$ docker inspect er7zhjrz3yfj
...
"Status": {
"Timestamp": "2018-08-24T20:26:41.844571484Z",
"State": "rejected",
"Message": "preparing",
"Err": "cannot create a swarm scoped network when swarm is not active",
"ContainerStatus": {},
"PortStatus": {}
},
...
The following signatures can be seen when a network which was near capacity exceeds capacity following the upgrade. To trigger this condition one can do the following:
-
create the test cluster as described above
-
in
exhaust_addrs/
run `$ (eval "$(testkit machine env tk1706-ubuntu-0)" && docker stack deploy -c docker-compose.yml ex)
and then
(eval "$(testkit machine env tk1706-ubuntu-0)" && docker service ls && docker service ps ex_service )
This puts a 24-container service on a 5-node cluster in an overlay network with
32 - 3 - 1 - 5 = 23
available IP addresses. The error looks like this:
ID NAME MODE REPLICAS IMAGE PORTS
wn3x4lu9cnln ex_service replicated 19/24 nginx:latest
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
xdevhaoovqew ex_service.1 nginx:latest tk1706-ubuntu-1 Running Preparing 27 seconds ago
q9fc83to7n8p ex_service.2 nginx:latest tk1706-ubuntu-4 Running Running 25 seconds ago
p5h3gbxct2cn ex_service.3 nginx:latest tk1706-ubuntu-2 Running Running 25 seconds ago
fs2380pgiy09 ex_service.4 nginx:latest tk1706-ubuntu-3 Running Running 25 seconds ago
vht1n02bu076 ex_service.5 nginx:latest tk1706-ubuntu-2 Running Running 25 seconds ago
ac4rqb92ln5r ex_service.6 nginx:latest tk1706-ubuntu-0 Running Running 25 seconds ago
ndopmbh4odwp ex_service.7 nginx:latest tk1706-ubuntu-0 Running Running 25 seconds ago
15v7de4wh86i ex_service.8 nginx:latest tk1706-ubuntu-3 Running Running 25 seconds ago
n5qe7b01rims ex_service.9 nginx:latest tk1706-ubuntu-3 Running Running 24 seconds ago
qwg2kgifx997 ex_service.10 nginx:latest tk1706-ubuntu-0 Running Running 25 seconds ago
i64lee19ia6s ex_service.11 nginx:latest tk1706-ubuntu-1 Running Preparing 27 seconds ago
mlngcwkok7lv ex_service.12 nginx:latest tk1706-ubuntu-4 Running Running 24 seconds ago
lvgu2chww41a ex_service.13 nginx:latest tk1706-ubuntu-2 Running Running 25 seconds ago
v1oris8jvhc1 ex_service.14 nginx:latest tk1706-ubuntu-1 Running Preparing 27 seconds ago
fzadkml9rx6s ex_service.15 nginx:latest tk1706-ubuntu-2 Running Running 25 seconds ago
ltwiom799rjz ex_service.16 nginx:latest tk1706-ubuntu-1 Running Preparing 27 seconds ago
rbewma4plt95 ex_service.17 nginx:latest tk1706-ubuntu-4 Running Running 25 seconds ago
vbthfd5pq9hc ex_service.18 nginx:latest tk1706-ubuntu-3 Running Running 25 seconds ago
58oo545x10ud ex_service.19 nginx:latest tk1706-ubuntu-3 Running Running 24 seconds ago
iz4dzpalegez ex_service.20 nginx:latest tk1706-ubuntu-0 Running Running 25 seconds ago
j3u7qsb6uyfe ex_service.21 nginx:latest tk1706-ubuntu-2 Running Running 25 seconds ago
zjcg1swgcvin ex_service.22 nginx:latest tk1706-ubuntu-4 Running Running 25 seconds ago
yp8cjae6xh7g ex_service.23 nginx:latest tk1706-ubuntu-0 Running Running 25 seconds ago
j53nz0hasu0e ex_service.24 nginx:latest tk1706-ubuntu-1 Running Preparing 27 seconds ago
Eventually it can settle to something like this:
ID NAME MODE REPLICAS IMAGE PORTS
wn3x4lu9cnln ex_service replicated 24/24 nginx:latest
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
2w6uzqbqtxff ex_service.1 nginx:latest tk1706-ubuntu-3 Running Running 7 minutes ago
xdevhaoovqew \_ ex_service.1 nginx:latest tk1706-ubuntu-1 Shutdown Rejected 7 minutes ago "node is missing network attac…"
q9fc83to7n8p ex_service.2 nginx:latest tk1706-ubuntu-4 Running Running 8 minutes ago
p5h3gbxct2cn ex_service.3 nginx:latest tk1706-ubuntu-2 Running Running 8 minutes ago
fs2380pgiy09 ex_service.4 nginx:latest tk1706-ubuntu-3 Running Running 8 minutes ago
vht1n02bu076 ex_service.5 nginx:latest tk1706-ubuntu-2 Running Running 8 minutes ago
ac4rqb92ln5r ex_service.6 nginx:latest tk1706-ubuntu-0 Running Running 8 minutes ago
ndopmbh4odwp ex_service.7 nginx:latest tk1706-ubuntu-0 Running Running 8 minutes ago
15v7de4wh86i ex_service.8 nginx:latest tk1706-ubuntu-3 Running Running 8 minutes ago
n5qe7b01rims ex_service.9 nginx:latest tk1706-ubuntu-3 Running Running 8 minutes ago
qwg2kgifx997 ex_service.10 nginx:latest tk1706-ubuntu-0 Running Running 8 minutes ago
kah5rzm81dbs ex_service.11 nginx:latest tk1706-ubuntu-4 Running Running 7 minutes ago
i64lee19ia6s \_ ex_service.11 nginx:latest tk1706-ubuntu-1 Shutdown Rejected 7 minutes ago "node is missing network attac…"
mlngcwkok7lv ex_service.12 nginx:latest tk1706-ubuntu-4 Running Running 8 minutes ago
lvgu2chww41a ex_service.13 nginx:latest tk1706-ubuntu-2 Running Running 8 minutes ago
pv88mx99lq3x ex_service.14 nginx:latest tk1706-ubuntu-0 Running Running 7 minutes ago
v1oris8jvhc1 \_ ex_service.14 nginx:latest tk1706-ubuntu-1 Shutdown Rejected 7 minutes ago "node is missing network attac…"
fzadkml9rx6s ex_service.15 nginx:latest tk1706-ubuntu-2 Running Running 8 minutes ago
i3o8qmuwf8ls ex_service.16 nginx:latest tk1706-ubuntu-2 Running Running 7 minutes ago
ltwiom799rjz \_ ex_service.16 nginx:latest tk1706-ubuntu-1 Shutdown Rejected 7 minutes ago "node is missing network attac…"
rbewma4plt95 ex_service.17 nginx:latest tk1706-ubuntu-4 Running Running 8 minutes ago
vbthfd5pq9hc ex_service.18 nginx:latest tk1706-ubuntu-3 Running Running 8 minutes ago
58oo545x10ud ex_service.19 nginx:latest tk1706-ubuntu-3 Running Running 8 minutes ago
iz4dzpalegez ex_service.20 nginx:latest tk1706-ubuntu-0 Running Running 8 minutes ago
j3u7qsb6uyfe ex_service.21 nginx:latest tk1706-ubuntu-2 Running Running 8 minutes ago
zjcg1swgcvin ex_service.22 nginx:latest tk1706-ubuntu-4 Running Running 8 minutes ago
yp8cjae6xh7g ex_service.23 nginx:latest tk1706-ubuntu-0 Running Running 8 minutes ago
xb4lzjrkgadk ex_service.24 nginx:latest tk1706-ubuntu-4 Running Running 7 minutes ago
j53nz0hasu0e \_ ex_service.24 nginx:latest tk1706-ubuntu-1 Shutdown Rejected 7 minutes ago "node is missing network attac…"
Looking carefully at the output one will notice that there are no Running
tasks on tk1706-ubuntu-1. This is because the rejected tasks were eventually
able to be assigned to nodes that already had load balancing IP addresses.
This is because with one node without assigned tasks, the total available IP
addresses becomes 32 - 3 - 1 - 4 = 24
. This is not guaranteed to occur
because whether it does depends upon the order that swarm assigns IP addresses
to nodes and to tasks.
Inspecting one of the failed tasks such as by doing the following on one of the manager nodes:
$ docker inspect i64lee19ia6s
will yield output like:
...
"Status": {
"Timestamp": "2018-08-24T21:03:37.885405884Z",
"State": "rejected",
"Message": "preparing",
"Err": "node is missing network attachments, ip addresses may be exhausted",
"ContainerStatus": {
"ContainerID": "",
"PID": 0,
"ExitCode": 0
},
"PortStatus": {}
},
...
(Note that i64lee19ia6s
was determined from the output of docker serivce ps ...
above.)