container leak

Question

container leak

jsievers opened this issue 8 years ago · 11 comments

Description

since some weeks we are seeing concourse containers not being reaped.
Concourse 2.1.0, garden-runc/0.8.0, bosh-openstack-kvm-ubuntu-trusty-go_agent/3263.3
The symptom is insufficient subnets remaining in the pool similar to concourse/concourse#293
It takes several days for the leak to reach the state where workers run out of containers (250 containers per worker)
The effect can be seen when comparing fly containers with fly workers: fly workers shows much more containers than fly containers.
When you bosh ssh into the worker and use gaol list, you can see the "zombie" containers.
Trying to gaol shell into a "zombie" container gives an error

root@7402fe78-8b6d-4c6e-97fb-e809337980ea:~# /tmp/gaol shell fdd75cce-c0ae-46ee-729b-6f9374525ab9
error: hijack: Backend error: Exit status: 500, message: {"Type":"","Message":"unable to find user root: no matching entries in passwd file","Handle":""}

gaol properties of a "dead" container looks like this:

garden.grace-time       300000000000
garden.network.host-ip  10.254.0.93
kawasaki.container-interface    wui37gmn4uj7-1
kawasaki.bridge-interface       wbrdg-0afe005c
kawasaki.dns-servers
kawasaki.mtu    1500
concourse:volume-mounts {"2b3475a7-1e37-42c5-7a4b-e49e68074495":"/tmp/build/get"}
garden.network.container-ip     10.254.0.94
garden.network.external-ip      10.1.6.17
garden.state    created
concourse:resource-result       {
  "version": {
    "digest": "sha256:4e9752e8f15bff07872664e7e206678e45dd84bccce02383f9b29a2bd1501864"
  },
  "metadata": [
    {
      "name": "image",
      "value": "sha256:4c07d"
    }
  ]
}

kawasaki.host-interface wui37gmn4uj7-0
kawasaki.iptable-prefix w--

so it looks like all zombie containers are concourse:resource-result containers, i.e. containers which should have done a git clone.

Logging and/or test output

The last log entries of a "dead" container:

{"timestamp":"1475624427.399276257","source":"guardian","message":"guardian.run.exec.finished","log_level":1,"data":{"handle":"a9645f41-a32f-4fa7-58e4-98838a3191d8","id":"a9645f41-a32f-4fa7-58e4-98838a3191d8","path":"/opt/resource/in","session":"6796.2"}}
{"timestamp":"1475624427.399291754","source":"guardian","message":"guardian.run.finished","log_level":1,"data":{"handle":"a9645f41-a32f-4fa7-58e4-98838a3191d8","path":"/opt/resource/in","session":"6796"}}
{"timestamp":"1475624427.399307251","source":"guardian","message":"guardian.api.garden-server.run.spawned","log_level":1,"data":{"handle":"a9645f41-a32f-4fa7-58e4-98838a3191d8","id":"337a97b6-3343-48a4-6d62-37b032f7729b","session":"3.1.52629","spec":{"Path":"/opt/resource/in","Dir":"","User":"root","Limits":{},"TTY":null}}}
{"timestamp":"1475624637.809005260","source":"guardian","message":"guardian.api.garden-server.run.exited","log_level":1,"data":{"handle":"a9645f41-a32f-4fa7-58e4-98838a3191d8","id":"337a97b6-3343-48a4-6d62-37b032f7729b","session":"3.1.52629","status":0}}

in comparison to a container, which was cleaned up (same steps are executed, except that in the case of the zombiw container the reaper did not kick in):

{"timestamp":"1475653406.035202265","source":"guardian","message":"guardian.run.exec.finished","log_level":1,"data":{"handle":"96692456-981b-454e-5180-7d841ca488a8","id":"96692456-981b-454e-5180-7d841ca488a8","path":"landscape/documentation/concourse/tasks/smoke/test.sh","session":"9625.2"}}
{"timestamp":"1475653406.035219669","source":"guardian","message":"guardian.run.finished","log_level":1,"data":{"handle":"96692456-981b-454e-5180-7d841ca488a8","path":"landscape/documentation/concourse/tasks/smoke/test.sh","session":"9625"}}
{"timestamp":"1475653406.035238743","source":"guardian","message":"guardian.api.garden-server.run.spawned","log_level":1,"data":{"handle":"96692456-981b-454e-5180-7d841ca488a8","id":"e25b4989-2ac0-44cf-6334-f5e893d60068","session":"3.1.58572","spec":{"Path":"landscape/documentation/concourse/tasks/smoke/test.sh","Dir
":"/tmp/build/f541ec31","User":"root","Limits":{},"TTY":{}}}}
{"timestamp":"1475653432.606876373","source":"guardian","message":"guardian.api.garden-server.run.exited","log_level":1,"data":{"handle":"96692456-981b-454e-5180-7d841ca488a8","id":"e25b4989-2ac0-44cf-6334-f5e893d60068","session":"3.1.58572","status":0}}
{"timestamp":"1475653929.393552065","source":"guardian","message":"guardian.api.garden-server.reaping","log_level":1,"data":{"grace-time":"5m0s","handle":"96692456-981b-454e-5180-7d841ca488a8","session":"3.1"}}
{"timestamp":"1475653929.393662930","source":"guardian","message":"guardian.destroy.start","log_level":1,"data":{"handle":"96692456-981b-454e-5180-7d841ca488a8","session":"9724"}}
{"timestamp":"1475653929.394035816","source":"guardian","message":"guardian.destroy.started","log_level":1,"data":{"handle":"96692456-981b-454e-5180-7d841ca488a8","session":"9725"}}
{"timestamp":"1475653929.394057512","source":"guardian","message":"guardian.destroy.state.started","log_level":1,"data":{"handle":"96692456-981b-454e-5180-7d841ca488a8","session":"9725.1"}}
{"timestamp":"1475653929.404460430","source":"guardian","message":"guardian.destroy.state.finished","log_level":1,"data":{"handle":"96692456-981b-454e-5180-7d841ca488a8","session":"9725.1"}}
{"timestamp":"1475653929.404501915","source":"guardian","message":"guardian.destroy.state","log_level":1,"data":{"handle":"96692456-981b-454e-5180-7d841ca488a8","session":"9725","state":{"Pid":28253,"Status":"created"}}}
{"timestamp":"1475653929.404526949","source":"guardian","message":"guardian.destroy.delete.started","log_level":1,"data":{"handle":"96692456-981b-454e-5180-7d841ca488a8","session":"9725.2"}}
{"timestamp":"1475653929.515482426","source":"guardian","message":"guardian.destroy.delete.finished","log_level":1,"data":{"handle":"96692456-981b-454e-5180-7d841ca488a8","session":"9725.2"}}
{"timestamp":"1475653929.515547752","source":"guardian","message":"guardian.destroy.destroy.started","log_level":1,"data":{"handle":"96692456-981b-454e-5180-7d841ca488a8","session":"9725.3"}}
{"timestamp":"1475653929.517130852","source":"guardian","message":"guardian.destroy.destroy.finished","log_level":1,"data":{"handle":"96692456-981b-454e-5180-7d841ca488a8","session":"9725.3"}}
{"timestamp":"1475653929.517160177","source":"guardian","message":"guardian.destroy.finished","log_level":1,"data":{"handle":"96692456-981b-454e-5180-7d841ca488a8","session":"9725"}}
{"timestamp":"1475653929.535522699","source":"guardian","message":"guardian.create.containerizer-create.watch.done","log_level":1,"data":{"handle":"96692456-981b-454e-5180-7d841ca488a8","session":"9623.1.4"}}
{"timestamp":"1475653929.726925850","source":"guardian","message":"guardian.volume-plugin.destroying.layer-already-deleted-skipping","log_level":1,"data":{"error":"could not find image: no such id: 76051528cd636f7704dea82da6cdd438a8e12de7cf99008ae715724c265cf0d2","graphID":"96692456-981b-454e-5180-7d841ca488a8","handle":"96692456-981b-454e-5180-7d841ca488a8","id":"96692456-981b-454e-5180-7d841ca488a8","session":"9728"}}
{"timestamp":"1475653929.727013826","source":"guardian","message":"guardian.destroy.finished","log_level":1,"data":{"handle":"96692456-981b-454e-5180-7d841ca488a8","session":"9724"}}

Looks like the question is why the guardian destroy
https://github.com/cloudfoundry/garden/blob/c7ed40f0b983c8d082dcdfc3dcd5adfa1020195f/server/request_handling.go#L128
does not kick in for the containers that do not run any processes anymore.

Steps to reproduce

The problem is hard to reproduce and the leak is slow (~2 containers per hour not being reaped)

Guardian release version garden-runc/0.8.0
Linux kernel version bosh-openstack-kvm-ubuntu-trusty-go_agent/3263.3
Concourse version 2.1.0
Go version 1.6.1

Answer 1 · 2016-10-10T08:33:22.000Z

Hi there!

We use Pivotal Tracker to provide visibility into what our team is working on. A story for this issue has been automatically created.

The current status is as follows:

#131978773 container leak

This comment, as well as the labels on the issue, will be automatically updated as the status in Tracker changes.

Answer 2 · 2016-10-10T08:55:10.000Z

maybe duplicate cloudfoundry/garden-runc-release#18

Answer 3 · 2016-10-10T10:09:30.000Z

Garden doesn't automatically destroy containers when the last process exits, that's up to the client (concourse in this case), which is how you are able to fly in the failed containers for some time after they exit. If the container appears in fly workers but not in fly containers that implies that concourse has not destroyed them, which seems like a concourse bug (you can clearly see from the logs that we never received a destroy call, so of course we didn't destroy anything..).

/cc @vito could concourse have maybe lost track of some containers it should have destroyed?

Answer 4 · 2016-10-10T16:29:46.000Z

Just to clarify our experience (number 18 closed above, thanks 👍 ), I'm pretty sure that what we see in fly workers matches what we see in fly containers. We think our situation came about due to the upgrade process (from Concourse 1.6 runC 0.4 to Concourse 2.2.1 runC 0.8).

Answer 5 · 2016-10-10T16:37:34.000Z

@julz As of today Concourse never calls Destroy; it relies on heartbeating and GraceTime to let Garden destroy the containers itself. So normally once we stop using a container it'll go away eventually.

In upcoming versions we'll switch to explicit calls to Destroy, which will make these errors much easier to notice, but I'm not convinced that it's a Concourse bug at the moment. If the container is gone from fly containers but fly workers still reports it that means Concourse stopped caring about it and it expired, but Garden didn't hold up its end of the bargain. (The containers in the DB follow the same heartbeating rules as the real containers.)

Answer 6 · 2016-10-19T00:32:26.000Z

In troubleshooting further issues with "insufficient subnets remaining in the pool", as well as "fork: Resource temporarily unavailable" (EAGAIN), we've observed a case where our BOSH stemcell VM doesn't mount cgroups. This is a problem for running containers with runc, as Garden-runc does.

We found that the runc project has a 'check-config.sh' script that makes various checks of the system to make sure it is able to run containers. Our VMs don't pass this check. Full output appended below, but the key message is:

Generally Necessary:
- cgroup hierarchy: nonexistent??
    (see https://github.com/tianon/cgroupfs-mount)

Following that link, it seems we could apt-get install cgroup-lite (on Ubuntu Trusty) to have the cgroup filesystems get mounted.

We are currently using fairly old stemcells:
bosh-aws-xen-hvm-ubuntu-trusty-go_agent | ubuntu-trusty | 3262.2*

We can certainly try updating the stemcell, or just install cgroup-lite somehow.
To what extent does garden check the capabilities of the machine its running on?
Are we totally off-base with this line of exploration towards resolving the "subnets" and "fork: EAGAIN" issues? We didn't see much of anyone else going so far.
d#, @cjcjameson, cc to @ryantang

Appendix:

# /var/vcap/packages/runc/src/github.com/opencontainers/runc/script/check-config.sh
warning: /proc/config.gz does not exist, searching other paths for kernel config ...
info: reading kernel config from /boot/config-3.19.0-64-generic ...

Generally Necessary:
- cgroup hierarchy: nonexistent??
    (see https://github.com/tianon/cgroupfs-mount)
- apparmor: enabled and tools installed
- CONFIG_NAMESPACES: enabled
- CONFIG_NET_NS: enabled
- CONFIG_PID_NS: enabled
- CONFIG_IPC_NS: enabled
- CONFIG_UTS_NS: enabled
- CONFIG_CGROUPS: enabled
- CONFIG_CGROUP_CPUACCT: enabled
- CONFIG_CGROUP_DEVICE: enabled
- CONFIG_CGROUP_FREEZER: enabled
- CONFIG_CGROUP_SCHED: enabled
- CONFIG_CPUSETS: enabled
- CONFIG_MEMCG: enabled
- CONFIG_KEYS: enabled
- CONFIG_MACVLAN: enabled (as module)
- CONFIG_VETH: enabled (as module)
- CONFIG_BRIDGE: enabled (as module)
- CONFIG_BRIDGE_NETFILTER: enabled (as module)
- CONFIG_NF_NAT_IPV4: enabled (as module)
- CONFIG_IP_NF_FILTER: enabled (as module)
- CONFIG_IP_NF_TARGET_MASQUERADE: enabled (as module)
- CONFIG_NETFILTER_XT_MATCH_ADDRTYPE: enabled (as module)
- CONFIG_NETFILTER_XT_MATCH_CONNTRACK: enabled (as module)
- CONFIG_NF_NAT: enabled (as module)
- CONFIG_NF_NAT_NEEDED: enabled
- CONFIG_POSIX_MQUEUE: enabled

Optional Features:
- CONFIG_USER_NS: enabled
- CONFIG_SECCOMP: enabled
- CONFIG_CGROUP_PIDS: missing
- CONFIG_MEMCG_SWAP: enabled
- CONFIG_MEMCG_SWAP_ENABLED: missing
    (note that cgroup swap accounting is not enabled in your kernel config, you can enable it by setting boot option "swapaccount=1")
- CONFIG_MEMCG_KMEM: enabled
- CONFIG_BLK_CGROUP: enabled
- CONFIG_BLK_DEV_THROTTLING: enabled
- CONFIG_IOSCHED_CFQ: enabled
- CONFIG_CFQ_GROUP_IOSCHED: enabled
- CONFIG_CGROUP_PERF: enabled
- CONFIG_CGROUP_HUGETLB: enabled
- CONFIG_NET_CLS_CGROUP: enabled (as module)
- CONFIG_CGROUP_NET_PRIO: enabled
- CONFIG_CFS_BANDWIDTH: enabled
- CONFIG_FAIR_GROUP_SCHED: enabled
- CONFIG_RT_GROUP_SCHED: missing

Answer 7 · 2016-10-19T07:22:55.000Z

In the meantime we upgraded concourse and garden/runc.

Using

Concourse 2.2.1, garden-runc/0.9.0, bosh-openstack-kvm-ubuntu-trusty-go_agent/3263.3

we can no longer reproduce the problem. fly containers and fly workers yield similar (low) numbers for a week now on the deployment which showed the leak before.

Judging from the release notes we suppose that garden-runc to 0.9.0 fixed it:

Ensure deletes are atomic: even if garden is killed during deletes, the delete can now be completed on restart

this can be closed for our concerns (unless you want to keep it open for other scenarios also reported here)

Answer 8 · 2016-10-19T10:04:16.000Z

Ok, I'll close this since it sounds like upgrading solved it. Regarding the cgroup thing garden sets that all up for you on startup so that should be fine. I can't quite figure out why 0.9.0 would fix this unless we were being SIGKILLed somehow before (that's the case the change above fixed), but let's keep an eye out and feel free to re-open if it does occur again!

Answer 9 · 2016-10-19T22:20:50.000Z

We upgraded our stemcell to 3263.7 today (has runc 1.0.0-rc1), but check-config.sh still does not pass (one more check passes, though: CONFIG_CGROUP_PIDS). There are no cgroup hierarchy filesystems mounted, so it doesn't seem like garden is actually setting this up right. We do see a tmpfs mounted at /sys/fs/cgroup, and it has some dirs named with names of cgroups in it, but no cgroup filesystems are mounted under it.

Would it be better if I open a new issue to continue this discussion?

Answer 10 · 2016-10-21T07:55:53.000Z

Hey @dsharp-pivotal,

It sounds like you may be running check-config.sh from the "wrong" mount namespace.
Guardian actually runs in a separate mount namespace to the default/host namespace. This is achieved via a binary called 'the-secret-garden' (which you might have seen in the process list).

You can enter the "correct" (aka Guardian's) mount namespace as follows:

/var/vcap/packages/guardian/bin/inspector-garden -pid $(pidof guardian) /bin/bash

Now if you run a cat /proc/self/mounts, you should be able to see the actual cgroup mounts.

Answer 11 · 2016-10-21T20:42:29.000Z

@teddyking Aha, thank you for clearing that up for me. I see check-config.sh passing in that container too. We'll continue to monitor our systems, but I don't think we've seen any further issues since upgrading.