Metaswitch/clearwater-docker

clearwater-cluster-manager doesn't restart Cassandra under Docker

Closed this issue · 13 comments

mirw commented

Symptoms

Spin up a deployment under Docker using an etcd-using version (i.e. commit later than ae892d7).

Cassandra doesn't start, so the deployment is not functional.

Killing Cassandra (using pkill -f cassandra) and then restarting clearwater-infrastructure (using /etc/init.d/clearwater-infrastructure restart on both Homestead and Homer seems to resolve the problem.

Impact

The deployment is not functional.

Release and environment

Seen on release-98.

Steps to reproduce

Simply start up a deployment using an etcd-using version.

mirw commented

I think the issue here is that /usr/share/clearwater/clearwater-cluster-manager/plugins/cassandra_plugin.py tries to use start-stop-daemon with a PID to stop Cassandra, but this doesn't work under Docker (as we don't save off the PID).

mirw commented

Actually, the work-around above (of killing cassandra and restarting clearwater-infrastructure) doesn't seem 100% reliable - it worked on Homstead, but not on Homer.

mirw commented

It seems sometimes necessary to kill Cassandra multiple times for the work-around to work - I'm not sure why this is.

I've also noticed that stopping the deployment doesn't seem to work cleanly.

Why is live-test-docker working reliably if we have this bug?

mirw commented

Not sure, but it's not just me that's hit this - it's also been hit on the mailing list (http://lists.projectclearwater.org/pipermail/clearwater_lists.projectclearwater.org/2016-May/002954.html).

Agree that we should investigate how live-test-docker differs from our documented install process as part of resolving this. (I notice that it does pass different parameters to docker, although can't immediately see how these would be significant.)

I've successfully turned up a Docker system and made a call through it. I've used the latest versions of Docker and Compose to rule out the possibility that there's some recent regression. In other words, I can't repro this.

It also looks like the mailing list user has now got Docker working (http://lists.projectclearwater.org/pipermail/clearwater_lists.projectclearwater.org/2016-May/002963.html).

One issue I did hit, which might explain the issues, is that if I stop the Docker deployment and start it again, the containers were assigned different IP addresses. We don't automatically spot IP address changes and reconfigure our databases, so changing the IP address means that Cassandra can't start (because cassandra.yaml still has the old IP). (This is not specific to Docker - e.g. see https://github.com/Metaswitch/clearwater-etcd/issues/287)

I'm pretty sure (having checked with @graemerobertson ) that we've never had a documented procedure for changing Clearwater's IP addresses, let alone an automated procedure - I'll make sure the product owner's aware of this, but it feels like new function, not bugfix.

sudo docker-compose -f minimal-distributed.yaml up --force-recreate works (because it recreates the instances instead of just restarting them).

@mirw , is it possible you stopped and then started your Docker containers?

mirw commented

I don't think so, and I've just reproduced it. I'll send you the details of the box on which I was testing.

Hmm, interesting. On your machine, ps shows this:

root@0e2e864ca5a8:/# ps axo pid,ppid,cmd,etime; date
  PID  PPID CMD                             ELAPSED
    1     0 /usr/bin/python /usr/bin/su    02:40:32
    7     1 /usr/sbin/sshd -D              02:40:30
  207     1 /usr/bin/etcd --listen-clie    02:40:28
  234     1 /usr/share/clearwater/clear    02:40:26
  238     1 /bin/sh /etc/init.d/homeste    02:40:26
  239     1 nginx: master process /usr/    02:40:26
  242   238 /usr/share/clearwater/crest    02:40:26
  243   239 nginx: worker process          02:40:26
  244   239 nginx: worker process          02:40:26
  245   239 nginx: worker process          02:40:26
  246   239 nginx: worker process          02:40:26
  273     1 /bin/sh /usr/sbin/cassandra    02:40:25
  276   273 [java] <defunct>               02:40:25
 7350     0 /bin/bash                      01:10:11
25754   234 /bin/sh -c /usr/share/clear    02:06:49
25755 25754 /bin/bash /usr/share/clearw    02:06:49
25765 25755 /bin/sh /usr/bin/nodetool e    02:06:49
25768 25765 [java] <defunct>               02:06:49
26967  7350 ps axo pid,ppid,cmd,etime         00:00
Mon Jun  6 20:05:42 UTC 2016

So:

  • /usr/sbin/cassandra started at 17:25:17
  • it spawned a Java process, which is now defunct

Looking at the cluster manager log, 17:25:17 is too early for Cassandra - cluster-manager only started at that time, and didn't put a cassandra.yaml file into place until 17:25:54. But I'd expect /usr/sbin/cassandra to fail and restart until cassandra.yaml was in place.

Attaching with strace, I get:

root@0e2e864ca5a8:/# strace -p 273                                                                
Process 273 attached
wait4(-1, 

So it's waiting for a child process - -1 as the first argument means "wait for any child process" - but its only child process is marked as defunct (i.e. zombie state, should be reaped by wait()). Why isn't this happening?

gdb is unhelpful:

(gdb) bt
#0  0x00007f5c50390aba in wait4 () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f5c508c5e0c in ?? ()
#2  0x00007f5c508c749d in ?? ()
#3  0x00007f5c508c4da0 in ?? ()
#4  0x00007f5c508c4efb in ?? ()
#5  0x00007f5c508c1595 in ?? ()
#6  0x00007f5c508c0746 in ?? ()
#7  0x00007f5c508c791e in ?? ()
#8  0x00007f5c508c7a78 in ?? ()
#9  0x00007f5c508c1292 in ?? ()
#10 0x00007f5c508c17df in ?? ()
#11 0x00007f5c508c0746 in ?? ()
#12 0x00007f5c508c0746 in ?? ()
#13 0x00007f5c508c791e in ?? ()
#14 0x00007f5c508bed30 in ?? ()
#15 0x00007f5c502f1f45 in __libc_start_main (main=0x7f5c508bec60, argc=3, argv=0x7ffd9268e648, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, 
    stack_end=0x7ffd9268e638) at libc-start.c:287
#16 0x00007f5c508bee5b in ?? ()

Kernel versions are different on a system hitting this bug (3.13.0-74-generic) and two system not hitting this bug (3.13.0-57-generic on docker-staging, 3.13.0-83-generic on my dev box). I can't find any relevant-looking bug reports at https://launchpad.net/ubuntu/+source/linux/+bugs though.

Tomorrow I might try and repro this on a 3.13.0-74-generic system then upgrade it to 3.13.0-83-generic and see if that fixes it.

Setting up from scratch with the Ubuntu Trusty AMI, ami-fce3c696, which uses 3.13.0-74-generic by default, shows the same symptoms:

root@a4a6c391efb1:~# ls -l /etc/cassandra/cassandra.yaml
-rw-r--r-- 1 root root 2696 Jun  7 08:06 /etc/cassandra/cassandra.yaml
root@a4a6c391efb1:~# ps axo pid,ppid,cmd,etime; date
  PID  PPID CMD                             ELAPSED
    1     0 /usr/bin/python /usr/bin/su       04:16
    9     1 /usr/sbin/sshd -D                 04:14
  209     1 /usr/bin/etcd --listen-clie       04:11
  237     1 /usr/share/clearwater/clear       04:10
  240     1 /bin/sh /etc/init.d/homeste       04:10
  241     1 nginx: master process /usr/       04:10
  244   240 /usr/share/clearwater/crest       04:09
  246   241 nginx: worker process             04:09
  247   241 nginx: worker process             04:09
  248   241 nginx: worker process             04:09
  249   241 nginx: worker process             04:09
  266     1 /bin/sh /usr/sbin/cassandra       04:09
  294   266 [java] <defunct>                  04:09
  404     9 sshd: root@pts/0                  03:51
  432   404 -bash                             03:46
 6871   432 ps axo pid,ppid,cmd,etime         00:00
Tue Jun  7 08:10:02 UTC 2016

Now to try with a higher kernel version.

OK, I have:

  • started a 3.13.0-74-generic VM, set up clearwater-docker, and confirmed that repros the problem
  • upgraded that machine to 3.13.0-87-generic (installing Docker installs the latest kernel, so I just rebooted), rebuilt the Docker images, and confirmed that the problem doesn't repro (I don't get a defunct Java process, I get a running Java/Cassandra process)
  • uninstalled that kernel (sudo apt-get remove linux-image-3.13.0-87-generic) and rebooted, putting me back on 3.13.0-74-generic, rebuilt the Docker images, and confirmed that the problem still repros (i.e. it's specifically related to the kernel, not to rebooting)
  • started a new VM, upgraded to 3.13.0-87-generic and rebooted before doing anything Docker-related, then set up clearwater-docker and confirmed the problem doesn't repro

I think that's pretty conclusive that this is an issue with the 3.13.0-74-generic kernel. I don't think it's worth pursuing this upstream, as a newer kernel that fixes the issue has been released - but I'll add a note to the README suggesting you upgrade kernels if you see this problem.

#25 adds clear advice not to use that kernel.