Failed to fdwalk: Operation not permitted (close_range syscall / seccomp issue)
mviereck opened this issue · 18 comments
This is odd. It seems zsh wants to do some privileged setup.
x11docker drops all capabilities with --cap-drop=ALL --security-opt=no-new-privileges
. Please try with --cap-default
to get the default capabilities that docker or podman would give to a container.
Edit: Just tried with xfce 4.16 on debian testing and could not reproduce. Works as expected.
Ok this is now resolved. See https://gitlab.xfce.org/apps/xfce4-terminal/-/issues/116#note_30805.
The problem seems to be coming from vte3>=0.63.91
. I'm not fully convinced this is the problem because I'm on archlinux, and we have the latest vte3 (0.64.0) installed, and xfce4-terminal works no problem. In retrospect, we are currently on a stable branch, and not building from master, so that might be a contributing factor?
I will make a PR to fix the problem once I fully understand what is causing it. As for now, I will rebuild the image to use vte3==0.62.3, which does not seem to affect the terminal.
Edit: Just tried with xfce 4.16 on debian testing and could not reproduce. Works as expected.
I tried an Ubuntu image built from source, and that one worked, but when I checked which version of vte3 they use, I saw they are using 0.60.3, so this is probably the same version you have on debian.
This is odd. It seems zsh wants to do some privileged setup.
That's the deceptive part about that message that led me on a chase for privileges to add to the container. I added all the privileges I could, but none of them was enough to get past the issue.
I finally arrived to the actual line in vte's source code which produces that error. I don't want to bore you with how I got to it, but the call to open the terminal starts somewhere here (from xfce4-terminal):
And makes its way into vte3 which leads to:
https://github.com/GNOME/vte/blob/a5817b18ec2e770cfea0e51ae3350943c3287956/src/spawn.cc#L727
and...
https://github.com/GNOME/vte/blob/109a6cf6e05ef55b79f768a30fdf95723ebba0d3/src/spawn.cc#L352-L353
then...
https://github.com/GNOME/vte/blob/a5817b18ec2e770cfea0e51ae3350943c3287956/src/missing.cc#L161
and finally
https://github.com/GNOME/vte/blob/a5817b18ec2e770cfea0e51ae3350943c3287956/src/missing.cc#L176
^ that's where everything blows up.
The syscall to close_range
is apparently unallowed inside docker containers. You can reliably reproduce it on your end with this code (taken from here):
#define _GNU_SOURCE
#include <fcntl.h>
#include <linux/close_range.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/syscall.h>
#include <string.h>
#include <unistd.h>
#include <dirent.h>
/* Show the contents of the symbolic links in /proc/self/fd */
static void
show_fds(void)
{
DIR *dirp = opendir("/proc/self/fd");
if (dirp == NULL) {
perror("opendir");
exit(EXIT_FAILURE);
}
for (;;) {
struct dirent *dp = readdir(dirp);
if (dp == NULL)
break;
if (dp->d_type == DT_LNK) {
char path[PATH_MAX], target[PATH_MAX];
snprintf(path, sizeof(path), "/proc/self/fd/%s",
dp->d_name);
ssize_t len = readlink(path, target, sizeof(target));
printf("%s ==> %.*s\n", path, (int) len, target);
}
}
closedir(dirp);
}
int
main(int argc, char *argv[])
{
for (int j = 1; j < argc; j++) {
int fd = open(argv[j], O_RDONLY);
if (fd == -1) {
perror(argv[j]);
exit(EXIT_FAILURE);
}
printf("%s opened as FD %d\n", argv[j], fd);
}
show_fds();
printf("========= About to call close_range() =======\n");
if (syscall(__NR_close_range, 3, ~0U, 0) == -1) {
perror("close_range");
exit(EXIT_FAILURE);
}
show_fds();
exit(EXIT_SUCCESS);
}
Run it by supplying a bunch of files as arguments. All it does is open all the files, then use close_range
syscall to close them.
This code runs fine on my host machine, but in a container, it fails with (you guessed it!) "Permission denied".
In conclusion, the problem is not with zsh (I actually tried it with bash login shell, and the same problem occurred), or x11docker, but rather with a close_range
syscall done by vte3
, which is used by xfce4-terminal
Issue reported here: https://github.com/containers/podman/issues/10130
I assume this is related to seccomp profiles, but I'm not sure yet
I assume this is related to seccomp profiles, but I'm not sure yet
Thank you for the detailed report!
It might be worth to run with --security-opt seccomp=unconfined
like:
x11docker --desktop -- --security-opt seccomp=unconfined -- x11docker/xfce
Yep. I can confirm that does in fact fix the problem. Do you know why this works? I was under the impression that docker's default seccomp profile should not have interfered with that syscall, because close_range
is whitelisted in that file.
I even downloaded that json file and used it to run the container, and it was still blocking close_range
. Unless podman is simply ignoring that flag, I'm not sure.
Do you know the difference between --security-opt seccomp=unconfined
and --privileged
?
Do you know the difference between
--security-opt seccomp=unconfined
and--privileged
?
Found the answer here:
Using the
--privileged
flag when creating a container with docker run disables seccomp in all versions of docker - even if you explicitly specify a seccomp profile. In general you should avoid using the--privileged
flag as it does too many things. You can achieve the same goal with--cap-add ALL --security-opt apparmor=unconfined --security-opt seccomp=unconfined
. If you need access to devices use-ice
.
So I guess security-wise, it may not be all that bad using just --security-opt seccomp=unconfined
I even downloaded that json file and used it to run the container, and it was still blocking close_range.
That is odd.
x11docker also sets --security-opt label=type:container_runtime_t
to allow access to shared X unix sockets. I wonder if this somehow interferes and disallows close_range
. I doubt it, however, I barely understand SELinux at all.
So I guess security-wise, it may not be all that bad using just --security-opt seccomp=unconfined
I think so, too. My debian system does not use SELinux at all.
I've confused SELinux and seccomp here. They are different security layers. I am not familiar with both of them.
--security-opt seccomp=unconfined
disables the seccomp profile for containers. Likely somewhere in this profile close_range
is forbidden or not whitelisted.
Edit:
Some documentation: https://docs.docker.com/engine/security/seccomp/
The default docker seccomp profile whitelists close_range
: https://github.com/moby/moby/blob/master/profiles/seccomp/default.json
I use podman, and their default profile also whitelists close_range
.
That is odd.
x11docker also sets --security-opt label=type:container_runtime_t to allow access to shared X unix sockets. I wonder if this somehow interferes and disallows close_range. I doubt it, however, I barely understand SELinux at all.
I think the next course of action is to run the container without x11docker
. This should once and for all remove all blame from x11docker. Once I get the chance, I will do that and report my findings here
I doubt it too. In the bug report I created at containers/podman#10337, I was able to reproduce the issue with a simple image built with buildah
(walk.c
is the same c code I pasted above):
buildah bud --no-cache --platform linux/amd64 -f - /tmp <<'EOF'
FROM alpine:edge
RUN apk update && apk add --upgrade build-base libc-dev linux-headers
COPY walk.c /app/walk.c
RUN gcc -o /app/walk /app/walk.c
ENTRYPOINT ["/app/walk"]
EOF
Running the above container gives the same Permission denied
error.
This seems to be fixed in podman now: containers/podman#10337 (comment)
docker seems to be affected, too, not fixed yet: docker/for-linux#1262
This happened for me while trying to run Ubuntu MATE 21.10 inside x11docker 6.6.2 with usual command x11docker --desktop ubuntu-mate:impish --sudouser --dbus --clipboard --pulseaudio --xtest --init=systemd
which worked normally for 21.04.
Now it can't start MATE Terminal with bash inside it:
So I have reported bug 1935995 to launchpad.
Currently I'm running x11docker version 6.9.0 and Docker CE version 20.10.73-0ubuntu-bionic . I can't understand how to change my startup options to include -- --security-opt seccomp=unconfined
. Could anyone, please help me?
I can't understand how to change my startup options to include -- --security-opt seccomp=unconfined
Put the image name to the end of the command.
Add -- --security-opt seccomp=unconfined --
before the image name.
x11docker --desktop --sudouser --dbus --clipboard --pulseaudio --xtest --init=systemd -- --security-opt seccomp=unconfined -- ubuntu-mate:impish
This follows the syntax for custom RUN_OPTIONS
:
x11docker [OPTIONS] -- RUN_OPTIONS -- IMAGE [COMMAND [ARG1 ARG2 ...]]
Thanks! Now it works. I will document it somewhere.
Found elsewhere:
updating "crun" from 0.10 to 0.19 fixed the issue.
If the issue still occurs with (default) runc
, it might help to switch to crun
.
I just assume that this issue is meanwhile fixed in most distributions.