rkt/rkt

Unable to GC some containers

jcollie opened this issue · 24 comments

Using CentOS 7 (kernel 3.10.0-327.3.1.el7.x86_64), rkt 0.14.0, I'm unable to GC some containers:

[root@svr05 ~]# rkt gc
Garbage collecting pod "42e78965-c60b-4f4f-b412-484cd381fe90"
Error getting stage1 treeStoreID: no such file or directory
Skipping stage1 GC
Unable to remove pod "42e78965-c60b-4f4f-b412-484cd381fe90": remove /var/lib/rkt/pods/exited-garbage/42e78965-c60b-4f4f-b412-484cd381fe90/stage1/rootfs: device or resource busy
[root@svr05 ~]# mount | fgrep 42e78965
[root@svr05 ~]# lsof +D /var/lib/rkt/pods/exited-garbage/42e78965-c60b-4f4f-b412-484cd381fe90
[root@svr05 ~]#  uname -a
Linux svr05.ocjtech.us 3.10.0-327.3.1.el7.x86_64 #1 SMP Wed Dec 9 14:09:15 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
[root@svr05 ~]# rkt version
rkt version 0.14.0
appc version 0.7.4
[root@svr05 ~]#  rkt list
1 error(s) encountered when listing pods:
----------------------------------------
Unable to read pod 42e78965-c60b-4f4f-b412-484cd381fe90 manifest:
  no such file or directory
----------------------------------------
Misc:
  rkt's appc version: 0.7.4
----------------------------------------

UUID        APP     IMAGE NAME          STATE   NETWORKS
09914839    sabnzbd     ocjtech.us/sabnzbd:0.12     running 
bd8a5ffe    sonarr      ocjtech.us/sonarr:0.10      running 
e40e09e7    privateinternet ocjtech.us/pia:0.4      exited  
        transmission    ocjtech.us/transmission:0.2     
[root@svr05 ~]# mount
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime,seclabel)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
devtmpfs on /dev type devtmpfs (rw,nosuid,seclabel,size=1889732k,nr_inodes=472433,mode=755)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,seclabel)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,seclabel,gid=5,mode=620,ptmxmode=000)
tmpfs on /run type tmpfs (rw,nosuid,nodev,seclabel,mode=755)
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,seclabel,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpuacct,cpu)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
configfs on /sys/kernel/config type configfs (rw,relatime)
/dev/mapper/centos-root on / type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
selinuxfs on /sys/fs/selinux type selinuxfs (rw,relatime)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=28,pgrp=1,timeout=300,minproto=5,maxproto=5,direct)
mqueue on /dev/mqueue type mqueue (rw,relatime,seclabel)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,seclabel)
debugfs on /sys/kernel/debug type debugfs (rw,relatime)
/dev/mapper/centos-home on /home type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
/dev/sda1 on /boot type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
192.168.4.101,192.168.4.102,192.168.4.103:/ on /mnt/ceph type ceph (rw,relatime,name=admin,secret=<hidden>)
tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,seclabel,size=380028k,mode=700)
overlay on /var/lib/rkt/pods/run/bd8a5ffe-0480-4196-aab4-5e8040d7cb6e/stage1/rootfs type overlay (rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c30,c943",lowerdir=/var/lib/rkt/cas/tree/deps-sha512-78d2e3ef53a4963b3af0dcf533fb91874dc4642b7723e24fd5bf40e1f99ca9df/rootfs,upperdir=/var/lib/rkt/pods/run/bd8a5ffe-0480-4196-aab4-5e8040d7cb6e/overlay/deps-sha512-78d2e3ef53a4963b3af0dcf533fb91874dc4642b7723e24fd5bf40e1f99ca9df/upper,workdir=/var/lib/rkt/pods/run/bd8a5ffe-0480-4196-aab4-5e8040d7cb6e/overlay/deps-sha512-78d2e3ef53a4963b3af0dcf533fb91874dc4642b7723e24fd5bf40e1f99ca9df/work)
overlay on /var/lib/rkt/pods/run/bd8a5ffe-0480-4196-aab4-5e8040d7cb6e/stage1/rootfs/opt/stage2/sonarr/rootfs type overlay (rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c30,c943",lowerdir=/var/lib/rkt/cas/tree/deps-sha512-b6214b00f1d2cdac36c5bd4b378b108539a9c8b739b450a1405d0020ce09571c/rootfs,upperdir=/var/lib/rkt/pods/run/bd8a5ffe-0480-4196-aab4-5e8040d7cb6e/overlay/deps-sha512-b6214b00f1d2cdac36c5bd4b378b108539a9c8b739b450a1405d0020ce09571c/upper/sonarr,workdir=/var/lib/rkt/pods/run/bd8a5ffe-0480-4196-aab4-5e8040d7cb6e/overlay/deps-sha512-b6214b00f1d2cdac36c5bd4b378b108539a9c8b739b450a1405d0020ce09571c/work/sonarr)
overlay on /var/lib/rkt/pods/run/09914839-3a18-46e4-afe7-e70810683b18/stage1/rootfs type overlay (rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c149,c713",lowerdir=/var/lib/rkt/cas/tree/deps-sha512-78d2e3ef53a4963b3af0dcf533fb91874dc4642b7723e24fd5bf40e1f99ca9df/rootfs,upperdir=/var/lib/rkt/pods/run/09914839-3a18-46e4-afe7-e70810683b18/overlay/deps-sha512-78d2e3ef53a4963b3af0dcf533fb91874dc4642b7723e24fd5bf40e1f99ca9df/upper,workdir=/var/lib/rkt/pods/run/09914839-3a18-46e4-afe7-e70810683b18/overlay/deps-sha512-78d2e3ef53a4963b3af0dcf533fb91874dc4642b7723e24fd5bf40e1f99ca9df/work)
overlay on /var/lib/rkt/pods/run/09914839-3a18-46e4-afe7-e70810683b18/stage1/rootfs/opt/stage2/sabnzbd/rootfs type overlay (rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c149,c713",lowerdir=/var/lib/rkt/cas/tree/deps-sha512-5a982fb08d2a7d8a774f2b85eecf0326482a0c3e9aafc18bacb64422a3cbe20f/rootfs,upperdir=/var/lib/rkt/pods/run/09914839-3a18-46e4-afe7-e70810683b18/overlay/deps-sha512-5a982fb08d2a7d8a774f2b85eecf0326482a0c3e9aafc18bacb64422a3cbe20f/upper/sabnzbd,workdir=/var/lib/rkt/pods/run/09914839-3a18-46e4-afe7-e70810683b18/overlay/deps-sha512-5a982fb08d2a7d8a774f2b85eecf0326482a0c3e9aafc18bacb64422a3cbe20f/work/sabnzbd)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,relatime)
overlay on /var/lib/rkt/pods/exited-garbage/e40e09e7-ff47-47b7-a821-eb0d1f319524/stage1/rootfs type overlay (rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c126,c297",lowerdir=/var/lib/rkt/cas/tree/deps-sha512-78d2e3ef53a4963b3af0dcf533fb91874dc4642b7723e24fd5bf40e1f99ca9df/rootfs,upperdir=/var/lib/rkt/pods/run/e40e09e7-ff47-47b7-a821-eb0d1f319524/overlay/deps-sha512-78d2e3ef53a4963b3af0dcf533fb91874dc4642b7723e24fd5bf40e1f99ca9df/upper,workdir=/var/lib/rkt/pods/run/e40e09e7-ff47-47b7-a821-eb0d1f319524/overlay/deps-sha512-78d2e3ef53a4963b3af0dcf533fb91874dc4642b7723e24fd5bf40e1f99ca9df/work)
overlay on /var/lib/rkt/pods/exited-garbage/e40e09e7-ff47-47b7-a821-eb0d1f319524/stage1/rootfs/opt/stage2/privateinternet/rootfs type overlay (rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c126,c297",lowerdir=/var/lib/rkt/cas/tree/deps-sha512-bf60d75289e2254fb14edde2de6742da5d4806c57b88634a55f643b08dae7120/rootfs,upperdir=/var/lib/rkt/pods/run/e40e09e7-ff47-47b7-a821-eb0d1f319524/overlay/deps-sha512-bf60d75289e2254fb14edde2de6742da5d4806c57b88634a55f643b08dae7120/upper/privateinternet,workdir=/var/lib/rkt/pods/run/e40e09e7-ff47-47b7-a821-eb0d1f319524/overlay/deps-sha512-bf60d75289e2254fb14edde2de6742da5d4806c57b88634a55f643b08dae7120/work/privateinternet)
overlay on /var/lib/rkt/pods/exited-garbage/e40e09e7-ff47-47b7-a821-eb0d1f319524/stage1/rootfs/opt/stage2/transmission/rootfs type overlay (rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:s0:c126,c297",lowerdir=/var/lib/rkt/cas/tree/deps-sha512-afb419a3ff4475f3b62926fc90a7b093276d853454d560bc9b254f3d5ec01143/rootfs,upperdir=/var/lib/rkt/pods/run/e40e09e7-ff47-47b7-a821-eb0d1f319524/overlay/deps-sha512-afb419a3ff4475f3b62926fc90a7b093276d853454d560bc9b254f3d5ec01143/upper/transmission,workdir=/var/lib/rkt/pods/run/e40e09e7-ff47-47b7-a821-eb0d1f319524/overlay/deps-sha512-afb419a3ff4475f3b62926fc90a7b093276d853454d560bc9b254f3d5ec01143/work/transmission)
proc on /var/lib/rkt/pods/exited-garbage/e40e09e7-ff47-47b7-a821-eb0d1f319524/netns type proc (rw,nosuid,nodev,noexec,relatime)
[root@svr05 ~]# 
alban commented

According to rmdir(2), rmdir returns EBUSY when "pathname is currently used as a mount point or is the root directory of the calling process".

First, can you check if any processes use the directory that cannot be removed as root directory? Something like:

# ROOTFS_INODE=$(stat --format=%i /var/lib/rkt/pods/run/8a45bd13-4625-44d5-b937-8a04e5b9a52c/stage1/rootfs/)
# for i in /proc/[0-9]*/ ; do echo -n "$i $(cat $i/comm) " ; stat -L $i/root | grep Device ; done | grep $ROOTFS_INODE

If you find remaining processes, check if they are in a different pid and mnt namespaces (/proc/[0-9]*/ns/{mnt,pid})

If it is not that, maybe pathname is still used as a mount point in some namespace. Before kernel 3.18, rmdir could return EBUSY when the directory is a mount point in another namespace. This could lead to DoS where the host cannot delete files because they are used as a mount point in a container. This was fixed in torvalds/linux@8ed936b. This might be why this bug is visible on old CentOS kernels but not newer ones.

But the container mnt namespace should be released when the container is terminated. I would not be surprised if old kernels were leaking mnt namespaces in some cases. There were circular references on mnt namespaces in older kernels (torvalds/linux@4ce5d2b).

I don't know how to easily check if it is the case. You could check if any process is in the container mnt namespace (/proc/[0-9]*/ns/mnt). But the mnt namespace could stay alive without any process in it if someone opens a reference on it via /proc/[0-9]*/ns/mnt or if a kernel bug maintains a reference on it.

In any cases, rkt could be patched to continue to GC the other pods when one fails with EBUSY like this.

I'm also having this problem, also on CentOS 7, same kernel as @jcollie. This host is using rkt 0.13.0.

[root@gocd-server-2b5933f2 ~]# rkt gc --grace-period=0
Garbage collecting pod "9bec4e8c-8b94-4017-b9a3-f7639ba98382"
Error getting stage1 treeStoreID: no such file or directory
[root@gocd-server-2b5933f2 ~]# ROOTFS_INODE=$(stat --format=%i /var/lib/rkt/pods/run/9bec4e8c-8b94-4017-b9a3-f7639ba98382/stage1/rootfs/)
stat: cannot stat ‘/var/lib/rkt/pods/run/9bec4e8c-8b94-4017-b9a3-f7639ba98382/stage1/rootfs/’: No such file or directory

So the rootfs doesn't exist anymore. Has the db gotten out of sync with the filesystem?

Nothing is showing up with those commands either, so I suspect that you're right about a kernel bug. I'd switch the system to CoreOS but it's a baremetal system and I'm not local to the box at the moment, plus I'll need some time to rethink how I do my container networking.

[root@svr05 ~]# rkt gc
Garbage collecting pod "42e78965-c60b-4f4f-b412-484cd381fe90"
Error getting stage1 treeStoreID: no such file or directory
Skipping stage1 GC
Unable to remove pod "42e78965-c60b-4f4f-b412-484cd381fe90": remove /var/lib/rkt/pods/exited-garbage/42e78965-c60b-4f4f-b412-484cd381fe90/stage1/rootfs: device or resource busy
[root@svr05 ~]# stat --format=%i /var/lib/rkt/pods/exited-garbage/42e78965-c60b-4f4f-b412-484cd381fe90/stage1/rootfs
96759102
[root@svr05 ~]# ROOTFS_INODE=$(stat --format=%i /var/lib/rkt/pods/exited-garbage/42e78965-c60b-4f4f-b412-484cd381fe90/stage1/rootfs)
[root@svr05 ~]# for i in /proc/[0-9]*/ ; do echo -n "$i $(cat $i/comm) " ; stat -L $i/root | grep Device ; done | grep $ROOTFS_INODE
[root@svr05 ~]# ls /proc/[0-9]*/ns/mnt
/proc/108/ns/mnt    /proc/19960/ns/mnt  /proc/3015/ns/mnt  /proc/4937/ns/mnt
/proc/10/ns/mnt     /proc/19/ns/mnt     /proc/3030/ns/mnt  /proc/507/ns/mnt
/proc/11/ns/mnt     /proc/1/ns/mnt      /proc/3035/ns/mnt  /proc/514/ns/mnt
/proc/12238/ns/mnt  /proc/20160/ns/mnt  /proc/3036/ns/mnt  /proc/51/ns/mnt
/proc/12239/ns/mnt  /proc/20173/ns/mnt  /proc/3040/ns/mnt  /proc/534/ns/mnt
/proc/12240/ns/mnt  /proc/20185/ns/mnt  /proc/3059/ns/mnt  /proc/54/ns/mnt
/proc/12455/ns/mnt  /proc/20188/ns/mnt  /proc/3060/ns/mnt  /proc/55/ns/mnt
/proc/12473/ns/mnt  /proc/20214/ns/mnt  /proc/3064/ns/mnt  /proc/573/ns/mnt
/proc/12/ns/mnt     /proc/20216/ns/mnt  /proc/30/ns/mnt    /proc/574/ns/mnt
/proc/1343/ns/mnt   /proc/20/ns/mnt     /proc/319/ns/mnt   /proc/575/ns/mnt
/proc/13/ns/mnt     /proc/21/ns/mnt     /proc/31/ns/mnt    /proc/576/ns/mnt
/proc/14697/ns/mnt  /proc/23/ns/mnt     /proc/32/ns/mnt    /proc/577/ns/mnt
/proc/14/ns/mnt     /proc/24/ns/mnt     /proc/381/ns/mnt   /proc/578/ns/mnt
/proc/1533/ns/mnt   /proc/2581/ns/mnt   /proc/382/ns/mnt   /proc/57/ns/mnt
/proc/1535/ns/mnt   /proc/2595/ns/mnt   /proc/38/ns/mnt    /proc/581/ns/mnt
/proc/1536/ns/mnt   /proc/25/ns/mnt     /proc/391/ns/mnt   /proc/586/ns/mnt
/proc/1537/ns/mnt   /proc/26/ns/mnt     /proc/392/ns/mnt   /proc/587/ns/mnt
/proc/1539/ns/mnt   /proc/275/ns/mnt    /proc/39/ns/mnt    /proc/588/ns/mnt
/proc/1576/ns/mnt   /proc/277/ns/mnt    /proc/3/ns/mnt     /proc/589/ns/mnt
/proc/1578/ns/mnt   /proc/27/ns/mnt     /proc/407/ns/mnt   /proc/590/ns/mnt
/proc/15/ns/mnt     /proc/28/ns/mnt     /proc/408/ns/mnt   /proc/606/ns/mnt
/proc/1619/ns/mnt   /proc/290/ns/mnt    /proc/409/ns/mnt   /proc/628/ns/mnt
/proc/1620/ns/mnt   /proc/291/ns/mnt    /proc/40/ns/mnt    /proc/632/ns/mnt
/proc/16282/ns/mnt  /proc/292/ns/mnt    /proc/410/ns/mnt   /proc/634/ns/mnt
/proc/1692/ns/mnt   /proc/293/ns/mnt    /proc/411/ns/mnt   /proc/636/ns/mnt
/proc/16/ns/mnt     /proc/2972/ns/mnt   /proc/412/ns/mnt   /proc/639/ns/mnt
/proc/17/ns/mnt     /proc/297/ns/mnt    /proc/413/ns/mnt   /proc/703/ns/mnt
/proc/1878/ns/mnt   /proc/298/ns/mnt    /proc/41/ns/mnt    /proc/759/ns/mnt
/proc/18/ns/mnt     /proc/2995/ns/mnt   /proc/42/ns/mnt    /proc/761/ns/mnt
/proc/19242/ns/mnt  /proc/299/ns/mnt    /proc/43/ns/mnt    /proc/76/ns/mnt
/proc/1928/ns/mnt   /proc/29/ns/mnt     /proc/4638/ns/mnt  /proc/7/ns/mnt
/proc/19737/ns/mnt  /proc/2/ns/mnt      /proc/4641/ns/mnt  /proc/8/ns/mnt
/proc/19947/ns/mnt  /proc/300/ns/mnt    /proc/483/ns/mnt   /proc/9/ns/mnt
[root@svr05 ~]# 
alban commented

I can reproduce this on CentOS 7 as well.

For some unknown reason, all rkt mounts are still mounted in systemd-udev mount namespace, even though they are correctly umounted in the host mount namespace:

# grep /var/lib/rkt/pods/ /proc/$(pidof systemd-udevd)/mountinfo
alban commented

The mount point / in the systemd-udev mount namespace is a slave mount of the / in the host mount namespace (see "master:1" and "shared:1"):

$ grep '/ / ' /proc/$(pidof systemd-udevd)/mountinfo
43 42 202:1 / / rw,relatime master:1 - xfs /dev/xvda1 rw,seclabel,attr2,inode64,noquota
$ grep '/ / ' /proc/1/mountinfo
60 1 202:1 / / rw,relatime shared:1 - xfs /dev/xvda1 rw,seclabel,attr2,inode64,noquota

With the new code for rkt-fly, rkt gc first sets the rkt mount points as private in the host mount namespace: see needsRemountPrivate

This block the umount propagation event from the host to the udevd mount namespaces, so the rkt mounts are never fully umounted and the rkt mount namespaces are not released.

The same leak exists on my Fedora 23 but since I have a recent kernel with torvalds/linux@8ed936b, I don't have the EBUSY symptom and the mount namespace leak is not visible to the user.

This was introduced by #1856

/cc @steveej

alban commented

@blalor the bug you've got seems to be different: it does not say "remove /var/lib/rkt/.../stage1/rootfs: device or resource busy" but "Error getting stage1 treeStoreID: no such file or directory
". Yours should be fixed in rkt 0.14.0 by #1828.

alban commented

Since systemd-v212, systemd-udevd is started with "MountFlags=slave", see systemd-udevd.service.

The CentOS 7 release has systemd-v208 but has updates for systemd-v219. I don't see the error message on GC with systemd-v208 but I see it with systemd-v219.

In any cases, other services can use systemd's "MountFlags" option, so rkt's GC needs to be fixed.

alban commented

@steveej I think GC should not set any mount point as MS_PRIVATE and rkt-fly should set all its mount points as MS_SLAVE+MS_SHARED. I tested it in this shell script and it seems to do what I want: the mount point gets umounted in the udevd namespace too:
https://gist.github.com/alban/75f605b8606b195008d6

I will continue this branch tomorrow: https://github.com/kinvolk/rkt/commits/alban/udevd

I'm afraid I can't reproduce this locally: https://gist.github.com/steveeJ/3f87d5939b973741d227

While the behavior I see is intended for rkt's use case, I'm not sure if it is intended by Linux.
It's questionable why the umounts are propagated to the mount namespace of systemd-udevd. The mounts obviously lose the master/shared attributes after being declared MS_PRIVATE in the host's namespace.

anyone else tired of trying to keep track of what's really in CentOS 7? kernel, systemd, basically everything else. :-(

Hi guys,

Same issue here

sudo /opt/bin/rkt gc --expire-prepared=0s --grace-period=0s
Garbage collecting pod "b49c3a51-161a-4448-a04b-2c8614c983bc"
Error getting stage1 treeStoreID: no such file or directory
Skipping stage1 GC
Unable to remove pod "b49c3a51-161a-4448-a04b-2c8614c983bc": remove /var/lib/rkt/pods/exited-garbage/b49c3a51-161a-4448-a04b-2c8614c983bc/stage1/rootfs/tmp: device or resource busy

My system is under Debian Jessie

Interestingly, something in the latest round of CentOS updates has fixed my GC problem. I'm now on kernel 3.10.0-327.4.5.el7.x86_64. Nothing in the changelog jumped out of me, but perhaps it was a change in some other package, not the kernel.

And I spoke too soon, same problem is back. 😞

I'm hitting this too, on Centos 7.2 with systemd-219-19 and kernel 3.10.0-327.4.5. Is there a workaround until it can be solved?

@alban and I looked a bit into this today.

We wrote a shell script that confirm that the CentOS kernel and more recent ones don't have the same behavior.

In the script we create a mount namespace, with / configuread as recursive+slave. Then we create two nested mountpoints to simulate the stage1 and stage2 rootfs and then we simulate GC by making these two mountpoints private, unmounting them and trying to delete them.

On a current kernel the output is:

In the host namespace:
243 39 0:44 / /tmp/udevd-experiment-8739/s1 rw,relatime shared:145 - tmpfs tmpfs rw
249 243 0:45 / /tmp/udevd-experiment-8739/s1/rootfs rw,relatime shared:149 - tmpfs tmpfs rw
In the udevd namespace:
244 237 0:44 / /tmp/udevd-experiment-8739/s1 rw,relatime shared:146 master:145 - tmpfs tmpfs rw
254 244 0:45 / /tmp/udevd-experiment-8739/s1/rootfs rw,relatime shared:152 master:149 - tmpfs tmpfs rw
Simulating GC...
In the host namespace:
In the udevd namespace:
/home/iaguis/udevd-experiment.sh: line 18: 25815 Terminated              sleep 5000

And on CentOS 7:

In the host namespace:
80 52 0:37 / /tmp/udevd-experiment-27508/s1 rw,relatime shared:63 - tmpfs tmpfs rw,seclabel
82 80 0:38 / /tmp/udevd-experiment-27508/s1/rootfs rw,relatime shared:65 - tmpfs tmpfs rw,seclabel
In the udevd namespace:
81 79 0:37 / /tmp/udevd-experiment-27508/s1 rw,relatime shared:64 master:63 - tmpfs tmpfs rw,seclabel
83 81 0:38 / /tmp/udevd-experiment-27508/s1/rootfs rw,relatime shared:66 master:65 - tmpfs tmpfs rw,seclabel
Simulating GC...
rm: cannot remove ‘/tmp/udevd-experiment-27508/s1’: Device or resource busy
In the host namespace:
In the udevd namespace:
81 79 0:37 / /tmp/udevd-experiment-27508/s1 rw,relatime shared:64 - tmpfs tmpfs rw,seclabel
83 81 0:38 / /tmp/udevd-experiment-27508/s1/rootfs rw,relatime shared:66 - tmpfs tmpfs rw,seclabel
/home/vagrant/rkt-v1.0.0/udevd-experiment.sh: line 18:  5821 Terminated              sleep 5000

We can see that on the recent kernel, after simulating GC the mount is not present in neither namespace. However, in CentOS, we get the Device or resource busy error and we see the mount being leaked in the "udev" namespace.

As a (nasty) workaround, to GC the rkt pods in CentOS you need to make sure you're not running any containers, stop the systemd-udevd and systemd-machined services and run the rkt GC command. Rebooting and running the GC should work as well.

Does anyone happen to know what the minimum working kernel version is? We really want to adopt rkt, but this is blocking for us.

The lowest version that I've tried that works is 4.3.

If it's true, docs should be updated at least in https://github.com/coreos/rkt/blob/v1.1.0/Documentation/dependencies.md

alban commented

I updated the shell script from #1922 (comment) to https://gist.github.com/alban/27ee74b75904594f212f.

It shows that the difference of behavior is not in the umount but in the "rm -rf". That's caused by torvalds/linux@8ed936b and the lowest version that should work is v3.18.

There's one thing we could do to alleviate this when we make fly its own command.

We could implement the GC as it is now—that is, remounting the mounts with MS_PRIVATE—only for fly. In this way, only pods run by fly would have this problem, the others would GC fine.

It's not a solution but it can help for people running kernels <3.18 that don't use fly....

cc @steveej

alban commented

I'm updating the docs about the run-time dependency on Linux 3.18: #2282

Not sure what we should do here...

@iaguis I'm using the mainline kernel from elrepo and it works fine. Probably this can be just documented in the docs.

I'll do it similar in my chef-rkt cookbook and just provide a warning if kernel <3.18 is used.

Seeing this exact issue in docker. Switching systemd-udevd to use the host mountns (comment out MountFlags=slave) takes care of it.
What's strange is I can also enter the mountns for udevd and unmount the offending entry with no issue. Once I do that removal, I can remove in docker.