opencontainers/runc

[CVE-2019-19921]: Volume mount race condition with shared mounts

leoluk opened this issue · 12 comments

Disclosed in #2190.

Here's the original report to security@opencontainers.org:

Hi all,

an attacker who controls the container image for two containers that share a volume can race volume mounts during container initialization, by adding a symlink to the rootfs that points to a directory on the volume. The second container won't be able to see the actual mount, but it can race it by modifying the mount point on the volume.

This can be exploited for a full container breakout by racing readonly/mask mounts, allowing writes to dangerous paths like /proc/sys/kernel/core_pattern.

Example:

  • The rootfs of container A has a symlink /proc -> /evil/level1
  • Container A specifies a named volume mounted to /evil
  • Container B, started before container A, shares this named volume and repeatedly swaps /evil/level1 and /evil/level1~
  • Container A mounts procfs to /evil/level1~/level2, but when it remounts /proc/sys, it does so at /evil/level1/level2/sys.

This can reliably be reproduced using runc and podman on Fedora 30 (takes about 0-5s to win the race for me): https://gist.github.com/leoluk/82965ad9df58247202aa0e1878439092

SELinux would ordinarily prevent the exploit by disallowing container_t from writing usermodehelper_t, but it can be disabled by symlinking /proc/self/task/1/attr/exec to something benign like /proc/self/sched (bypassing the procfs check). AppArmor can be disabled similarly.

Docker specifies the mounts in a different order and mounts procfs after it mounts the volumes, mounting over the /proc symlink, which appears to prevent at least the /proc approach. I haven't tested other runc usage scenarios, for instance, k8s+cri-o might be vulnerable as well.

Fabian of Cure53 (in CC) created a minimal PoC that uses runc directly: https://gist.github.com/LiveOverflow/c937820b688922eb127fb760ce06dab9

There are other container init steps after the volume mount that can be raced, obvious ones being utils.CloseExecFrom and the AppArmor/SELinux attrs but there might be others, especially in mountToRootfs (like tricking remount into mounting the rootfs as rshared if there's another volume that specifies the flag, but I haven't tried that).

This is similar to the vulnerability I reported that Adam Iwaniuk disclosed during their Dragon Sector CTF (#2128) and a similar crun one (containers/crun#111).

The fix for the mounts is probably what Aleksa outlined here, using /proc/self/fd to resolve the path: containers/crun#111 (comment)

My proposed ("stop the bleeding") patch was something like the following:

commit 81a9af6677b1f87e70b87e9a655cb4f4d06a0503 (HEAD -> fix-double-volume-attack)
Author: Aleksa Sarai <asarai@suse.de>
Date:   Sat Dec 21 23:40:17 2019 +1100

    rootfs: do not permit /proc mounts to non-directories
    
    mount(2) will blindly follow symlinks, which is a problem because it
    allows a malicious container to trick runc into mounting /proc to an
    entirely different location (and thus within the attacker's control for
    a rename-exchange attack).
    
    This is just a hotfix, and the more complete fix would be finish
    libpathrs and port runc to it (to avoid these types of attacks entirely,
    and defend against a variety of other /proc-related attacks).
    
    Fixes: CVE-YYYY-XXXX
    Signed-off-by: Aleksa Sarai <asarai@suse.de>

diff --git a/libcontainer/rootfs_linux.go b/libcontainer/rootfs_linux.go
index 291021440a1a..6e896bc4fdaa 100644
--- a/libcontainer/rootfs_linux.go
+++ b/libcontainer/rootfs_linux.go
@@ -297,17 +297,49 @@ func mountToRootfs(m *configs.Mount, rootfs, mountLabel string, enableCgroupns b
                dest = filepath.Join(rootfs, dest)
        }
 
+       // For "special" filesystems, we have to be quite careful about mounting --
+       // we must make sure that the destination is what we expect. This is done
+       // by opening the destination as an O_PATH descriptor, and using the
+       // /proc/self/fd/... as the mount target. Unfortunately this is actually
+       // possible to bypass with a little bit of thought, but the complete
+       // solution for this will be to port runc to libpathrs.
        switch m.Device {
-       case "proc", "sysfs":
+       case "proc", "sysfs", "mqueue":
+               // NOTE: If the container controls any part of dest, this is unsafe.
                if err := os.MkdirAll(dest, 0755); err != nil {
                        return err
                }
+               destFd, err := unix.Open(dest, unix.O_PATH|unix.O_CLOEXEC, 0)
+               if err != nil {
+                       return err
+               }
+               defer unix.Close(destFd)
+
+               // Check that the path is exactly what we expect.
+               // NOTE: If the path contains an attacker-controlled bind-mount, this
+               //       check won't do anything. In addition, if procfs is fraudulent,
+               //       it will also be useless. As above, the solution is to switch
+               //       to libpathrs.
+               destFdPath := fmt.Sprintf("/proc/self/fd/%d", destFd)
+               destUnsafePath, err := os.Readlink(destFdPath)
+               if err != nil {
+                       return err
+               }
+               if destUnsafePath != dest {
+                       return fmt.Errorf("detected possible breakout: trying to mount '%s' on '%s' was actually targeted to '%s'", m.Device, dest, destUnsafePath)
+               }
+
+               // Okay, now we can use destFdPath.
+               dest = destFdPath
+               m.Destination = destFdPath
+       }
+
+       // Now actually do the mount.
+       switch m.Device {
+       case "proc", "sysfs":
                // Selinux kernels do not support labeling of /proc or /sys
                return mountPropagate(m, rootfs, "")
        case "mqueue":
-               if err := os.MkdirAll(dest, 0755); err != nil {
-                       return err
-               }
                if err := mountPropagate(m, rootfs, mountLabel); err != nil {
                        // older kernels do not support labeling of /dev/mqueue
                        if err := mountPropagate(m, rootfs, ""); err != nil {

Unfortunately this is not sufficient if / is shared with another container, because then you can do the same trick (but this time on / directly). It also needs some more work to work around the fact that there are m.Destination-based checks elsewhere in rootfs_linux.go.

Your patch does stop the bleeding, though - most runc use cases do not share the rootfs. Mounting a volume on / breaks all kinds of things. Haven't managed to do anything useful using either cri-o or podman.

Alright, I'll prepare a PR. Thanks @leoluk -- and sorry for the response time issues (as well as how the disclosure happened).

any ETA on the workaround to unblock rc10?

I've been off the face of the earth for the past 2ish weeks. I will prepare a PR tomorrow.

#2207 contains a very simplified version of the above patch (the patch I posted above doesn't work because rootfs_linux.go has a very fun relationship with pathnames that I don't have time to debug right now).

Beuc commented

Hi,

I'm part of the Debian Long Term Support (LTS) team, and I'm attempting to fix CVE-2019-19921 in our past releases that package "runc".
(apologizes for digging up this old issue :))

I'm still able to reproduce the vulnerability (using the runc reproducer linked in the original topic), in the following situations:

  • backporting the fix 2fc03cc to 1.0.0~rc6 (Debian 10 "buster"/"old-stable")
  • more annoyingly, with 1.0.0~rc93, as shipped in Debian 11 "bullseye"/current; for reference the fix was pushed to rc10

AFAICS the fix does make the exploit less likely, but does not stop it entirely: within a few minutes I'm still able to overwrite my root system's /proc/sys/kernel/core_pattern from container-2.

Is this expected (as in, it's a mitigation but not a bullet-proof fix)?
Or is there a follow-up fix that I missed?

Thanks for your attention and best regards.

leoluk commented

2fc03cc should completely prevent the exploit. It adds a check to avoid mounting procfs to /proc in the rootfs if the target is something other than a directory or absent, which makes it impossible to point it to an attacker-controlled bind mount. It's not possible to race /proc itself in this setup (the rootfs is not attacker-accessible during early setup).

Either there's a regression or something's wrong with the Debian backport.

Beuc commented

Thanks for your fast feedback!

Debian might have different dependency versions, because it mostly removes vendor/* and uses the packaged versions.
Thus I tried with a Ubuntu Focal (20.04) VM where 'runc' is built with the built-in vendor/*, to make sure if that was the reason.

Interestingly:

  • 1.0.0~rc10-0ubuntu1 correctly blocs the mount attempt early and 'runc run container-[12]' fails ("must be mounted on ordinary directory")
  • 1.0.0~rc95-0ubuntu1~20.04.2 is vulnerable to the PoC
  • 1.1.0-0ubuntu1~20.04.2 is vulnerable to the PoC

So AFAICS, despite the presence of the fix in all versions, some other commit re-introduced the issue.
(and similarly the fix alone didn't appear to fix ~rc6 in my previous message)

If you've got further insights I'd be grateful :)
Otherwise I can try and bisect to pinpoint when the fix lost its effectiveness (probably tomorrow).

Beuc commented

After a bit of digging, ironically it looks like the fix for this vulnerability (CVE-2019-19921) was broken by the one for CVE-2021-30465: 0ca91f4
This sounds like a regression as you suspected.

Do you want me to open a new ticket for this?
And register a new CVE (if you confirm)?

hi,I can not reproduce the vulnerability, I use debian 10, kernel version: Linux runc 4.19.0-23-amd64 #1 SMP Debian 4.19.269-1 (2022-12-20) x86_64 x86_64 x86_64 GNU/Linux. runc version: 1.0.0~rc93+ds1-5+deb11u2. When I run pwn in container-1, the error is "SYS_renameat2: Permission denied", can not change "/poc/layer".