Docker daemon dies on boot when cgroupsPath used
ijc opened this issue · 13 comments
After #14 Docker does not start. Shortly after boot without taking any other action the state becomes:
(ns: getty) linuxkit-d6247f201b73:~# ctr t ls
TASK PID STATUS
rngd 808 STOPPED
sshd 851 RUNNING
docker 547 STOPPED
getty 586 RUNNING
kubelet 644 RUNNING
kubernetes-docker-image-cache-common 678 STOPPED
kubernetes-docker-image-cache-control-plane 730 STOPPED
ntpd 771 RUNNING
The logs are uninteresting:
(ns: getty) linuxkit-d6247f201b73:~# cat /var/log/docker.err.log
[WARN tini (547)] Tini is not running as PID 1 and isn't registered as a child subreaper.
Zombie processes will not be re-parented to Tini, so zombie reaping won't work.
To fix the problem, use the -s option or set the environment variable TINI_SUBREAPER to register Tini as a child subreaper, or run Tini as PID 1.
time="2017-12-08T14:55:34.404476201Z" level=warning msg="could not change group /var/run/docker.sock to docker: group docker not found"
time="2017-12-08T14:55:34.405395092Z" level=info msg="libcontainerd: new containerd process, pid: 591"
time="2017-12-08T14:55:36.515729941Z" level=info msg="[graphdriver] using prior storage driver: overlay2"
time="2017-12-08T14:55:36.557961200Z" level=info msg="Graph migration to content-addressability took 0.00 seconds"
time="2017-12-08T14:55:36.558552846Z" level=info msg="Loading containers: start."
time="2017-12-08T14:55:36.717274716Z" level=warning msg="Running modprobe nf_nat failed with message: `ip: can't find device 'nf_nat'\nmodprobe: module nf_nat not found in modules.dep`, error: exit status 1"
time="2017-12-08T14:55:36.731672396Z" level=warning msg="Running modprobe xt_conntrack failed with message: `ip: can't find device 'xt_conntrack'\nmodprobe: module xt_conntrack not found in modules.dep`, error: exit status 1"
time="2017-12-08T14:55:37.065878091Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
time="2017-12-08T14:55:37.171766223Z" level=info msg="Loading containers: done."
time="2017-12-08T14:55:37.274945197Z" level=info msg="Docker daemon" commit=f4ffd25 graphdriver(s)=overlay2 version=17.10.0-ce
time="2017-12-08T14:55:37.275544195Z" level=info msg="Daemon has completed initialization"
time="2017-12-08T14:55:37.288354509Z" level=info msg="API listen on /var/run/docker.sock"
It seems that reverting #14 fixes things. I'll double check and raise a PR to revert while we sort this out.
/cc @justincormack. This was also mentioned in #11 (comment).
#14 says the change was based on https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources but I don't see any reference to podruntime nor systemreserved in the linked document. It doesn't seem to go into that sort of detail, it's about a kubelet flag that we don't use.
Looks like removing the cgroupsPath from the docker and kubelet services and leaving the others in place is sufficient to fix.
I had thought this was correlated with kubeadm being run, but actually it is something to do with the image cache loading, if I don't run kubeadm then it still dies a little while after boot. However if I remove the image caches from the image then it seems kubeadm can be run and things come up ok.
I added -D --log-level=debug to the daemon command line producing: docker.txt. I don't see anything interesting in there.
It's a bit hard to be sure, but using strace it looks like the dockerd process is being sent a SIGKILL form somewhere -- I can't see where though.
Putting docker and kubernetes-docker-image-cache-{control-plane,common} into three separate cgroups seems to resolve the issue.
I wonder if between them they are blowing some sort of cgroup memory limit when they all share the same space and triggering something like an OOM? I'm seeing no sign of that in dmesg though.
thats really odd. The cgroups just create the groups, but no limits are applied. Hmm.
I wondered if maybe there was some unhelpful default (e.g. 50% of RAM) in the kernel (seems a little unlikely) or if something (like kubelet itself maybe?) was taking it upon itself to provide some caps to existing cgroups. Perhaps we need to be passing it one or more of the --cgroup-* options it has, which can do reservations for system cgroups and stuff, to tell it what is what?
It can't be kubelet because this happens even when that service is still sat in our kubelet.sh wrapper waiting for /etc/kubernetes/kubelet.conf to be populated by kubeadm.
I'm playing with auditd trying to get a handle on the process doing the killing.
I'm not 100% sure yet, but it looks like it is a runc process which is doing it (I'm now trying to trace fork+exec and friends to try and reconstruct the process tree). I wonder if when the image cache services exit it is killing everything in the same cgroup?
The process which is killing docker-init + dockerd and docker-containerd as well as all the subprocesses of the control plane image cache service is:
type=EXECVE msg=audit(1513345148.627:875): argc=11 a0="runc" a1="--root" a2="/run/containerd/runc/default" a3="--log" a4="/run/containerd/io.containerd.runtime.v1.linux/default/kubernetes-docker-image-cache-common/log.json" a
5="--log-format" a6="json" a7="kill" a8="--all" a9="kubernetes-docker-image-cache-common" a10="9"
which is:
runc --root /run/containerd/runc/default \
--log /run/containerd/io.containerd.runtime.v1.linux/default/kubernetes-docker-image-cache-common/log.json \
--log-format json \
kill \
--all kubernetes-docker-image-cache-common 9
It's parent is the containerd-shim for the common cache service.
So it does appear that tearing down the common cache service (which is the first to exit) is tearing down all the processes in that cgroup.
@justincormack is it notable that 662d3d4 sets cgroupsPath on things but doesn't use runtime.cgroups on the kubelet to ensure they exist (I thought you said something about needing this for some reason). Not sure it relates to this issue but noticed it while investigating.
runc kill --all ends up calling into a function signalAllProcess which has a comment:
// signalAllProcesses freezes then iterates over all the processes inside the
// manager's cgroups sending the signal s to them.
// If s is SIGKILL then it will wait for each process to exit.
// For all other signals it will check if the process is ready to report its
// exit status and only if it is will a wait be performed.
So it seems to be expected that this will nuke the whole cgroup.
containerd/containerd#1932 (comment) says:
Its "supported", as in it will place them in the same one, but there are side effects. Just don't do it. The default cgroups config for containerd is to place each container in its own cgroup, under a namespace. There is no reason to put containers in the same cgroup and many more things at the runc level expect separate cgroups as well as most other container systems.
So I guess #14 needs to be redone using sub-cgroups under a namespace. @justincormack will you look into that?