container leak
jsievers opened this issue · 11 comments
Description
since some weeks we are seeing concourse containers not being reaped.
Concourse 2.1.0, garden-runc/0.8.0, bosh-openstack-kvm-ubuntu-trusty-go_agent/3263.3
The symptom is insufficient subnets remaining in the pool
similar to concourse/concourse#293
It takes several days for the leak to reach the state where workers run out of containers (250 containers per worker)
The effect can be seen when comparing fly containers
with fly workers
: fly workers
shows much more containers than fly containers.
When you bosh ssh
into the worker and use gaol list
, you can see the "zombie" containers.
Trying to gaol shell
into a "zombie" container gives an error
root@7402fe78-8b6d-4c6e-97fb-e809337980ea:~# /tmp/gaol shell fdd75cce-c0ae-46ee-729b-6f9374525ab9
error: hijack: Backend error: Exit status: 500, message: {"Type":"","Message":"unable to find user root: no matching entries in passwd file","Handle":""}
gaol properties of a "dead" container looks like this:
garden.grace-time 300000000000
garden.network.host-ip 10.254.0.93
kawasaki.container-interface wui37gmn4uj7-1
kawasaki.bridge-interface wbrdg-0afe005c
kawasaki.dns-servers
kawasaki.mtu 1500
concourse:volume-mounts {"2b3475a7-1e37-42c5-7a4b-e49e68074495":"/tmp/build/get"}
garden.network.container-ip 10.254.0.94
garden.network.external-ip 10.1.6.17
garden.state created
concourse:resource-result {
"version": {
"digest": "sha256:4e9752e8f15bff07872664e7e206678e45dd84bccce02383f9b29a2bd1501864"
},
"metadata": [
{
"name": "image",
"value": "sha256:4c07d"
}
]
}
kawasaki.host-interface wui37gmn4uj7-0
kawasaki.iptable-prefix w--
so it looks like all zombie containers are concourse:resource-result containers, i.e. containers which should have done a git clone.
Logging and/or test output
The last log entries of a "dead" container:
{"timestamp":"1475624427.399276257","source":"guardian","message":"guardian.run.exec.finished","log_level":1,"data":{"handle":"a9645f41-a32f-4fa7-58e4-98838a3191d8","id":"a9645f41-a32f-4fa7-58e4-98838a3191d8","path":"/opt/resource/in","session":"6796.2"}}
{"timestamp":"1475624427.399291754","source":"guardian","message":"guardian.run.finished","log_level":1,"data":{"handle":"a9645f41-a32f-4fa7-58e4-98838a3191d8","path":"/opt/resource/in","session":"6796"}}
{"timestamp":"1475624427.399307251","source":"guardian","message":"guardian.api.garden-server.run.spawned","log_level":1,"data":{"handle":"a9645f41-a32f-4fa7-58e4-98838a3191d8","id":"337a97b6-3343-48a4-6d62-37b032f7729b","session":"3.1.52629","spec":{"Path":"/opt/resource/in","Dir":"","User":"root","Limits":{},"TTY":null}}}
{"timestamp":"1475624637.809005260","source":"guardian","message":"guardian.api.garden-server.run.exited","log_level":1,"data":{"handle":"a9645f41-a32f-4fa7-58e4-98838a3191d8","id":"337a97b6-3343-48a4-6d62-37b032f7729b","session":"3.1.52629","status":0}}
in comparison to a container, which was cleaned up (same steps are executed, except that in the case of the zombiw container the reaper did not kick in):
{"timestamp":"1475653406.035202265","source":"guardian","message":"guardian.run.exec.finished","log_level":1,"data":{"handle":"96692456-981b-454e-5180-7d841ca488a8","id":"96692456-981b-454e-5180-7d841ca488a8","path":"landscape/documentation/concourse/tasks/smoke/test.sh","session":"9625.2"}}
{"timestamp":"1475653406.035219669","source":"guardian","message":"guardian.run.finished","log_level":1,"data":{"handle":"96692456-981b-454e-5180-7d841ca488a8","path":"landscape/documentation/concourse/tasks/smoke/test.sh","session":"9625"}}
{"timestamp":"1475653406.035238743","source":"guardian","message":"guardian.api.garden-server.run.spawned","log_level":1,"data":{"handle":"96692456-981b-454e-5180-7d841ca488a8","id":"e25b4989-2ac0-44cf-6334-f5e893d60068","session":"3.1.58572","spec":{"Path":"landscape/documentation/concourse/tasks/smoke/test.sh","Dir
":"/tmp/build/f541ec31","User":"root","Limits":{},"TTY":{}}}}
{"timestamp":"1475653432.606876373","source":"guardian","message":"guardian.api.garden-server.run.exited","log_level":1,"data":{"handle":"96692456-981b-454e-5180-7d841ca488a8","id":"e25b4989-2ac0-44cf-6334-f5e893d60068","session":"3.1.58572","status":0}}
{"timestamp":"1475653929.393552065","source":"guardian","message":"guardian.api.garden-server.reaping","log_level":1,"data":{"grace-time":"5m0s","handle":"96692456-981b-454e-5180-7d841ca488a8","session":"3.1"}}
{"timestamp":"1475653929.393662930","source":"guardian","message":"guardian.destroy.start","log_level":1,"data":{"handle":"96692456-981b-454e-5180-7d841ca488a8","session":"9724"}}
{"timestamp":"1475653929.394035816","source":"guardian","message":"guardian.destroy.started","log_level":1,"data":{"handle":"96692456-981b-454e-5180-7d841ca488a8","session":"9725"}}
{"timestamp":"1475653929.394057512","source":"guardian","message":"guardian.destroy.state.started","log_level":1,"data":{"handle":"96692456-981b-454e-5180-7d841ca488a8","session":"9725.1"}}
{"timestamp":"1475653929.404460430","source":"guardian","message":"guardian.destroy.state.finished","log_level":1,"data":{"handle":"96692456-981b-454e-5180-7d841ca488a8","session":"9725.1"}}
{"timestamp":"1475653929.404501915","source":"guardian","message":"guardian.destroy.state","log_level":1,"data":{"handle":"96692456-981b-454e-5180-7d841ca488a8","session":"9725","state":{"Pid":28253,"Status":"created"}}}
{"timestamp":"1475653929.404526949","source":"guardian","message":"guardian.destroy.delete.started","log_level":1,"data":{"handle":"96692456-981b-454e-5180-7d841ca488a8","session":"9725.2"}}
{"timestamp":"1475653929.515482426","source":"guardian","message":"guardian.destroy.delete.finished","log_level":1,"data":{"handle":"96692456-981b-454e-5180-7d841ca488a8","session":"9725.2"}}
{"timestamp":"1475653929.515547752","source":"guardian","message":"guardian.destroy.destroy.started","log_level":1,"data":{"handle":"96692456-981b-454e-5180-7d841ca488a8","session":"9725.3"}}
{"timestamp":"1475653929.517130852","source":"guardian","message":"guardian.destroy.destroy.finished","log_level":1,"data":{"handle":"96692456-981b-454e-5180-7d841ca488a8","session":"9725.3"}}
{"timestamp":"1475653929.517160177","source":"guardian","message":"guardian.destroy.finished","log_level":1,"data":{"handle":"96692456-981b-454e-5180-7d841ca488a8","session":"9725"}}
{"timestamp":"1475653929.535522699","source":"guardian","message":"guardian.create.containerizer-create.watch.done","log_level":1,"data":{"handle":"96692456-981b-454e-5180-7d841ca488a8","session":"9623.1.4"}}
{"timestamp":"1475653929.726925850","source":"guardian","message":"guardian.volume-plugin.destroying.layer-already-deleted-skipping","log_level":1,"data":{"error":"could not find image: no such id: 76051528cd636f7704dea82da6cdd438a8e12de7cf99008ae715724c265cf0d2","graphID":"96692456-981b-454e-5180-7d841ca488a8","handle":"96692456-981b-454e-5180-7d841ca488a8","id":"96692456-981b-454e-5180-7d841ca488a8","session":"9728"}}
{"timestamp":"1475653929.727013826","source":"guardian","message":"guardian.destroy.finished","log_level":1,"data":{"handle":"96692456-981b-454e-5180-7d841ca488a8","session":"9724"}}
Looks like the question is why the guardian destroy
https://github.com/cloudfoundry/garden/blob/c7ed40f0b983c8d082dcdfc3dcd5adfa1020195f/server/request_handling.go#L128
does not kick in for the containers that do not run any processes anymore.
Steps to reproduce
The problem is hard to reproduce and the leak is slow (~2 containers per hour not being reaped)
- Guardian release version garden-runc/0.8.0
- Linux kernel version bosh-openstack-kvm-ubuntu-trusty-go_agent/3263.3
- Concourse version 2.1.0
- Go version 1.6.1
Hi there!
We use Pivotal Tracker to provide visibility into what our team is working on. A story for this issue has been automatically created.
The current status is as follows:
- #131978773 container leak
This comment, as well as the labels on the issue, will be automatically updated as the status in Tracker changes.
maybe duplicate cloudfoundry/garden-runc-release#18
Garden doesn't automatically destroy containers when the last process exits, that's up to the client (concourse in this case), which is how you are able to fly
in the failed containers for some time after they exit. If the container appears in fly workers
but not in fly containers
that implies that concourse has not destroyed them, which seems like a concourse bug (you can clearly see from the logs that we never received a destroy call, so of course we didn't destroy anything..).
/cc @vito could concourse have maybe lost track of some containers it should have destroyed?
Just to clarify our experience (number 18 closed above, thanks 👍 ), I'm pretty sure that what we see in fly workers
matches what we see in fly containers
. We think our situation came about due to the upgrade process (from Concourse 1.6 runC 0.4 to Concourse 2.2.1 runC 0.8).
@julz As of today Concourse never calls Destroy
; it relies on heartbeating and GraceTime
to let Garden destroy the containers itself. So normally once we stop using a container it'll go away eventually.
In upcoming versions we'll switch to explicit calls to Destroy
, which will make these errors much easier to notice, but I'm not convinced that it's a Concourse bug at the moment. If the container is gone from fly containers
but fly workers
still reports it that means Concourse stopped caring about it and it expired, but Garden didn't hold up its end of the bargain. (The containers in the DB follow the same heartbeating rules as the real containers.)
In troubleshooting further issues with "insufficient subnets remaining in the pool", as well as "fork: Resource temporarily unavailable" (EAGAIN), we've observed a case where our BOSH stemcell VM doesn't mount cgroups. This is a problem for running containers with runc, as Garden-runc does.
We found that the runc project has a 'check-config.sh' script that makes various checks of the system to make sure it is able to run containers. Our VMs don't pass this check. Full output appended below, but the key message is:
Generally Necessary:
- cgroup hierarchy: nonexistent??
(see https://github.com/tianon/cgroupfs-mount)
Following that link, it seems we could apt-get install cgroup-lite
(on Ubuntu Trusty) to have the cgroup filesystems get mounted.
We are currently using fairly old stemcells:
bosh-aws-xen-hvm-ubuntu-trusty-go_agent | ubuntu-trusty | 3262.2*
- We can certainly try updating the stemcell, or just install cgroup-lite somehow.
- To what extent does garden check the capabilities of the machine its running on?
- Are we totally off-base with this line of exploration towards resolving the "subnets" and "fork: EAGAIN" issues? We didn't see much of anyone else going so far.
- d#, @cjcjameson, cc to @ryantang
Appendix:
# /var/vcap/packages/runc/src/github.com/opencontainers/runc/script/check-config.sh
warning: /proc/config.gz does not exist, searching other paths for kernel config ...
info: reading kernel config from /boot/config-3.19.0-64-generic ...
Generally Necessary:
- cgroup hierarchy: nonexistent??
(see https://github.com/tianon/cgroupfs-mount)
- apparmor: enabled and tools installed
- CONFIG_NAMESPACES: enabled
- CONFIG_NET_NS: enabled
- CONFIG_PID_NS: enabled
- CONFIG_IPC_NS: enabled
- CONFIG_UTS_NS: enabled
- CONFIG_CGROUPS: enabled
- CONFIG_CGROUP_CPUACCT: enabled
- CONFIG_CGROUP_DEVICE: enabled
- CONFIG_CGROUP_FREEZER: enabled
- CONFIG_CGROUP_SCHED: enabled
- CONFIG_CPUSETS: enabled
- CONFIG_MEMCG: enabled
- CONFIG_KEYS: enabled
- CONFIG_MACVLAN: enabled (as module)
- CONFIG_VETH: enabled (as module)
- CONFIG_BRIDGE: enabled (as module)
- CONFIG_BRIDGE_NETFILTER: enabled (as module)
- CONFIG_NF_NAT_IPV4: enabled (as module)
- CONFIG_IP_NF_FILTER: enabled (as module)
- CONFIG_IP_NF_TARGET_MASQUERADE: enabled (as module)
- CONFIG_NETFILTER_XT_MATCH_ADDRTYPE: enabled (as module)
- CONFIG_NETFILTER_XT_MATCH_CONNTRACK: enabled (as module)
- CONFIG_NF_NAT: enabled (as module)
- CONFIG_NF_NAT_NEEDED: enabled
- CONFIG_POSIX_MQUEUE: enabled
Optional Features:
- CONFIG_USER_NS: enabled
- CONFIG_SECCOMP: enabled
- CONFIG_CGROUP_PIDS: missing
- CONFIG_MEMCG_SWAP: enabled
- CONFIG_MEMCG_SWAP_ENABLED: missing
(note that cgroup swap accounting is not enabled in your kernel config, you can enable it by setting boot option "swapaccount=1")
- CONFIG_MEMCG_KMEM: enabled
- CONFIG_BLK_CGROUP: enabled
- CONFIG_BLK_DEV_THROTTLING: enabled
- CONFIG_IOSCHED_CFQ: enabled
- CONFIG_CFQ_GROUP_IOSCHED: enabled
- CONFIG_CGROUP_PERF: enabled
- CONFIG_CGROUP_HUGETLB: enabled
- CONFIG_NET_CLS_CGROUP: enabled (as module)
- CONFIG_CGROUP_NET_PRIO: enabled
- CONFIG_CFS_BANDWIDTH: enabled
- CONFIG_FAIR_GROUP_SCHED: enabled
- CONFIG_RT_GROUP_SCHED: missing
In the meantime we upgraded concourse and garden/runc.
Using
Concourse 2.2.1, garden-runc/0.9.0, bosh-openstack-kvm-ubuntu-trusty-go_agent/3263.3
we can no longer reproduce the problem. fly containers
and fly workers
yield similar (low) numbers for a week now on the deployment which showed the leak before.
Judging from the release notes we suppose that garden-runc to 0.9.0 fixed it:
Ensure deletes are atomic: even if garden is killed during deletes, the delete can now be completed on restart
this can be closed for our concerns (unless you want to keep it open for other scenarios also reported here)
Ok, I'll close this since it sounds like upgrading solved it. Regarding the cgroup thing garden sets that all up for you on startup so that should be fine. I can't quite figure out why 0.9.0 would fix this unless we were being SIGKILLed somehow before (that's the case the change above fixed), but let's keep an eye out and feel free to re-open if it does occur again!
We upgraded our stemcell to 3263.7 today (has runc 1.0.0-rc1), but check-config.sh still does not pass (one more check passes, though: CONFIG_CGROUP_PIDS). There are no cgroup hierarchy filesystems mounted, so it doesn't seem like garden is actually setting this up right. We do see a tmpfs mounted at /sys/fs/cgroup, and it has some dirs named with names of cgroups in it, but no cgroup filesystems are mounted under it.
Would it be better if I open a new issue to continue this discussion?
Hey @dsharp-pivotal,
It sounds like you may be running check-config.sh
from the "wrong" mount namespace.
Guardian actually runs in a separate mount namespace to the default/host namespace. This is achieved via a binary called 'the-secret-garden' (which you might have seen in the process list).
You can enter the "correct" (aka Guardian's) mount namespace as follows:
/var/vcap/packages/guardian/bin/inspector-garden -pid $(pidof guardian) /bin/bash
Now if you run a cat /proc/self/mounts
, you should be able to see the actual cgroup mounts.
@teddyking Aha, thank you for clearing that up for me. I see check-config.sh passing in that container too. We'll continue to monitor our systems, but I don't think we've seen any further issues since upgrading.