Docker daemon dies on boot when cgroupsPath used

Question

Docker daemon dies on boot when cgroupsPath used

ijc opened this issue 8 years ago · 13 comments

After #14 Docker does not start. Shortly after boot without taking any other action the state becomes:

(ns: getty) linuxkit-d6247f201b73:~# ctr t ls
TASK                                           PID    STATUS    
rngd                                           808    STOPPED
sshd                                           851    RUNNING
docker                                         547    STOPPED
getty                                          586    RUNNING
kubelet                                        644    RUNNING
kubernetes-docker-image-cache-common           678    STOPPED
kubernetes-docker-image-cache-control-plane    730    STOPPED
ntpd                                           771    RUNNING

The logs are uninteresting:

(ns: getty) linuxkit-d6247f201b73:~# cat /var/log/docker.err.log 
[WARN  tini (547)] Tini is not running as PID 1 and isn't registered as a child subreaper.
Zombie processes will not be re-parented to Tini, so zombie reaping won't work.
To fix the problem, use the -s option or set the environment variable TINI_SUBREAPER to register Tini as a child subreaper, or run Tini as PID 1.
time="2017-12-08T14:55:34.404476201Z" level=warning msg="could not change group /var/run/docker.sock to docker: group docker not found"
time="2017-12-08T14:55:34.405395092Z" level=info msg="libcontainerd: new containerd process, pid: 591"
time="2017-12-08T14:55:36.515729941Z" level=info msg="[graphdriver] using prior storage driver: overlay2"
time="2017-12-08T14:55:36.557961200Z" level=info msg="Graph migration to content-addressability took 0.00 seconds"
time="2017-12-08T14:55:36.558552846Z" level=info msg="Loading containers: start."
time="2017-12-08T14:55:36.717274716Z" level=warning msg="Running modprobe nf_nat failed with message: `ip: can't find device 'nf_nat'\nmodprobe: module nf_nat not found in modules.dep`, error: exit status 1"
time="2017-12-08T14:55:36.731672396Z" level=warning msg="Running modprobe xt_conntrack failed with message: `ip: can't find device 'xt_conntrack'\nmodprobe: module xt_conntrack not found in modules.dep`, error: exit status 1"
time="2017-12-08T14:55:37.065878091Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
time="2017-12-08T14:55:37.171766223Z" level=info msg="Loading containers: done."
time="2017-12-08T14:55:37.274945197Z" level=info msg="Docker daemon" commit=f4ffd25 graphdriver(s)=overlay2 version=17.10.0-ce
time="2017-12-08T14:55:37.275544195Z" level=info msg="Daemon has completed initialization"
time="2017-12-08T14:55:37.288354509Z" level=info msg="API listen on /var/run/docker.sock"

It seems that reverting #14 fixes things. I'll double check and raise a PR to revert while we sort this out.

/cc @justincormack. This was also mentioned in #11 (comment).

Answer 1 · 2017-12-08T15:03:57.000Z

#14 says the change was based on https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources but I don't see any reference to podruntime nor systemreserved in the linked document. It doesn't seem to go into that sort of detail, it's about a kubelet flag that we don't use.

Answer 2 · 2017-12-08T15:45:18.000Z

Looks like removing the cgroupsPath from the docker and kubelet services and leaving the others in place is sufficient to fix.

Answer 3 · 2017-12-13T16:22:11.000Z

I had thought this was correlated with kubeadm being run, but actually it is something to do with the image cache loading, if I don't run kubeadm then it still dies a little while after boot. However if I remove the image caches from the image then it seems kubeadm can be run and things come up ok.

I added -D --log-level=debug to the daemon command line producing: docker.txt. I don't see anything interesting in there.

Answer 4 · 2017-12-13T16:59:03.000Z

It's a bit hard to be sure, but using strace it looks like the dockerd process is being sent a SIGKILL form somewhere -- I can't see where though.

Answer 5 · 2017-12-13T17:03:19.000Z

Putting docker and kubernetes-docker-image-cache-{control-plane,common} into three separate cgroups seems to resolve the issue.

I wonder if between them they are blowing some sort of cgroup memory limit when they all share the same space and triggering something like an OOM? I'm seeing no sign of that in dmesg though.

Answer 6 · 2017-12-13T17:36:58.000Z

thats really odd. The cgroups just create the groups, but no limits are applied. Hmm.

Answer 7 · 2017-12-14T10:41:00.000Z

I wondered if maybe there was some unhelpful default (e.g. 50% of RAM) in the kernel (seems a little unlikely) or if something (like kubelet itself maybe?) was taking it upon itself to provide some caps to existing cgroups. Perhaps we need to be passing it one or more of the --cgroup-* options it has, which can do reservations for system cgroups and stuff, to tell it what is what?

Answer 8 · 2017-12-14T16:36:09.000Z

It can't be kubelet because this happens even when that service is still sat in our kubelet.sh wrapper waiting for /etc/kubernetes/kubelet.conf to be populated by kubeadm.

Answer 9 · 2017-12-15T13:25:19.000Z

I'm playing with auditd trying to get a handle on the process doing the killing.

I'm not 100% sure yet, but it looks like it is a runc process which is doing it (I'm now trying to trace fork+exec and friends to try and reconstruct the process tree). I wonder if when the image cache services exit it is killing everything in the same cgroup?

Answer 10 · 2017-12-15T13:50:06.000Z

The process which is killing docker-init + dockerd and docker-containerd as well as all the subprocesses of the control plane image cache service is:

type=EXECVE msg=audit(1513345148.627:875): argc=11 a0="runc" a1="--root" a2="/run/containerd/runc/default" a3="--log" a4="/run/containerd/io.containerd.runtime.v1.linux/default/kubernetes-docker-image-cache-common/log.json" a
5="--log-format" a6="json" a7="kill" a8="--all" a9="kubernetes-docker-image-cache-common" a10="9"

which is:

runc --root /run/containerd/runc/default \
     --log /run/containerd/io.containerd.runtime.v1.linux/default/kubernetes-docker-image-cache-common/log.json \
     --log-format json \
     kill \
     --all kubernetes-docker-image-cache-common 9

It's parent is the containerd-shim for the common cache service.

So it does appear that tearing down the common cache service (which is the first to exit) is tearing down all the processes in that cgroup.

Answer 11 · 2017-12-15T13:55:45.000Z

@justincormack is it notable that 662d3d4 sets cgroupsPath on things but doesn't use runtime.cgroups on the kubelet to ensure they exist (I thought you said something about needing this for some reason). Not sure it relates to this issue but noticed it while investigating.

Answer 12 · 2017-12-15T14:03:29.000Z

runc kill --all ends up calling into a function signalAllProcess which has a comment:

// signalAllProcesses freezes then iterates over all the processes inside the
// manager's cgroups sending the signal s to them.
// If s is SIGKILL then it will wait for each process to exit.
// For all other signals it will check if the process is ready to report its
// exit status and only if it is will a wait be performed.

So it seems to be expected that this will nuke the whole cgroup.

Answer 13 · 2018-01-04T16:24:36.000Z

containerd/containerd#1932 (comment) says:

Its "supported", as in it will place them in the same one, but there are side effects. Just don't do it. The default cgroups config for containerd is to place each container in its own cgroup, under a namespace. There is no reason to put containers in the same cgroup and many more things at the runc level expect separate cgroups as well as most other container systems.

So I guess #14 needs to be redone using sub-cgroups under a namespace. @justincormack will you look into that?