mviereck/x11docker

Failed to fdwalk: Operation not permitted (close_range syscall / seccomp issue)

mviereck opened this issue · 18 comments

Coming from #345

@smac89 wrote:

I am currently in the middle of reporting a bug with the terminal running inside the container.

ArcoLinux_2021-04-22_01-57-16

You wouldn't happen to have an idea of why this might be happening inside the container? 😄

This is odd. It seems zsh wants to do some privileged setup.

x11docker drops all capabilities with --cap-drop=ALL --security-opt=no-new-privileges. Please try with --cap-default to get the default capabilities that docker or podman would give to a container.

Edit: Just tried with xfce 4.16 on debian testing and could not reproduce. Works as expected.

Ok this is now resolved. See https://gitlab.xfce.org/apps/xfce4-terminal/-/issues/116#note_30805.

The problem seems to be coming from vte3>=0.63.91. I'm not fully convinced this is the problem because I'm on archlinux, and we have the latest vte3 (0.64.0) installed, and xfce4-terminal works no problem. In retrospect, we are currently on a stable branch, and not building from master, so that might be a contributing factor?

I will make a PR to fix the problem once I fully understand what is causing it. As for now, I will rebuild the image to use vte3==0.62.3, which does not seem to affect the terminal.

Edit: Just tried with xfce 4.16 on debian testing and could not reproduce. Works as expected.

I tried an Ubuntu image built from source, and that one worked, but when I checked which version of vte3 they use, I saw they are using 0.60.3, so this is probably the same version you have on debian.

This is odd. It seems zsh wants to do some privileged setup.

That's the deceptive part about that message that led me on a chase for privileges to add to the container. I added all the privileges I could, but none of them was enough to get past the issue.

I finally arrived to the actual line in vte's source code which produces that error. I don't want to bore you with how I got to it, but the call to open the terminal starts somewhere here (from xfce4-terminal):

https://github.com/xfce-mirror/xfce4-terminal/blob/d082cec41b23b85131f0275229d846b1488c85fe/terminal/terminal-screen.c#L1984

And makes its way into vte3 which leads to:
https://github.com/GNOME/vte/blob/a5817b18ec2e770cfea0e51ae3350943c3287956/src/spawn.cc#L727

and...

https://github.com/GNOME/vte/blob/109a6cf6e05ef55b79f768a30fdf95723ebba0d3/src/spawn.cc#L352-L353
then...
https://github.com/GNOME/vte/blob/a5817b18ec2e770cfea0e51ae3350943c3287956/src/missing.cc#L161
and finally
https://github.com/GNOME/vte/blob/a5817b18ec2e770cfea0e51ae3350943c3287956/src/missing.cc#L176

^ that's where everything blows up.

The syscall to close_range is apparently unallowed inside docker containers. You can reliably reproduce it on your end with this code (taken from here):

#define _GNU_SOURCE
#include <fcntl.h>
#include <linux/close_range.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/syscall.h>
#include <string.h>
#include <unistd.h>
#include <dirent.h>

/* Show the contents of the symbolic links in /proc/self/fd */

static void
show_fds(void)
{
   DIR *dirp = opendir("/proc/self/fd");
   if (dirp  == NULL) {
       perror("opendir");
       exit(EXIT_FAILURE);
   }

   for (;;) {
       struct dirent *dp = readdir(dirp);
       if (dp == NULL)
           break;

       if (dp->d_type == DT_LNK) {
           char path[PATH_MAX], target[PATH_MAX];
           snprintf(path, sizeof(path), "/proc/self/fd/%s",
                    dp->d_name);

           ssize_t len = readlink(path, target, sizeof(target));
           printf("%s ==> %.*s\n", path, (int) len, target);
       }
   }

   closedir(dirp);
}

int
main(int argc, char *argv[])
{
   for (int j = 1; j < argc; j++) {
       int fd = open(argv[j], O_RDONLY);
       if (fd == -1) {
           perror(argv[j]);
           exit(EXIT_FAILURE);
       }
       printf("%s opened as FD %d\n", argv[j], fd);
   }

   show_fds();

   printf("========= About to call close_range() =======\n");

   if (syscall(__NR_close_range, 3, ~0U, 0) == -1) {
       perror("close_range");
       exit(EXIT_FAILURE);
   }

   show_fds();
   exit(EXIT_SUCCESS);
}

Run it by supplying a bunch of files as arguments. All it does is open all the files, then use close_range syscall to close them.

This code runs fine on my host machine, but in a container, it fails with (you guessed it!) "Permission denied".

In conclusion, the problem is not with zsh (I actually tried it with bash login shell, and the same problem occurred), or x11docker, but rather with a close_range syscall done by vte3, which is used by xfce4-terminal

Issue reported here: https://github.com/containers/podman/issues/10130

I assume this is related to seccomp profiles, but I'm not sure yet

I assume this is related to seccomp profiles, but I'm not sure yet

Thank you for the detailed report!
It might be worth to run with --security-opt seccomp=unconfined like:

x11docker --desktop -- --security-opt seccomp=unconfined -- x11docker/xfce

Yep. I can confirm that does in fact fix the problem. Do you know why this works? I was under the impression that docker's default seccomp profile should not have interfered with that syscall, because close_range is whitelisted in that file.

I even downloaded that json file and used it to run the container, and it was still blocking close_range. Unless podman is simply ignoring that flag, I'm not sure.

Do you know the difference between --security-opt seccomp=unconfined and --privileged?

Do you know the difference between --security-opt seccomp=unconfined and --privileged?

Found the answer here:

Using the --privileged flag when creating a container with docker run disables seccomp in all versions of docker - even if you explicitly specify a seccomp profile. In general you should avoid using the --privileged flag as it does too many things. You can achieve the same goal with --cap-add ALL --security-opt apparmor=unconfined --security-opt seccomp=unconfined. If you need access to devices use -ice.

So I guess security-wise, it may not be all that bad using just --security-opt seccomp=unconfined

I even downloaded that json file and used it to run the container, and it was still blocking close_range.

That is odd.
x11docker also sets --security-opt label=type:container_runtime_t to allow access to shared X unix sockets. I wonder if this somehow interferes and disallows close_range. I doubt it, however, I barely understand SELinux at all.

So I guess security-wise, it may not be all that bad using just --security-opt seccomp=unconfined

I think so, too. My debian system does not use SELinux at all.

I've confused SELinux and seccomp here. They are different security layers. I am not familiar with both of them.

--security-opt seccomp=unconfined disables the seccomp profile for containers. Likely somewhere in this profile close_range is forbidden or not whitelisted.

Edit:
Some documentation: https://docs.docker.com/engine/security/seccomp/
The default docker seccomp profile whitelists close_range: https://github.com/moby/moby/blob/master/profiles/seccomp/default.json

I use podman, and their default profile also whitelists close_range.

That is odd.
x11docker also sets --security-opt label=type:container_runtime_t to allow access to shared X unix sockets. I wonder if this somehow interferes and disallows close_range. I doubt it, however, I barely understand SELinux at all.

I think the next course of action is to run the container without x11docker. This should once and for all remove all blame from x11docker. Once I get the chance, I will do that and report my findings here

I doubt it too. In the bug report I created at containers/podman#10337, I was able to reproduce the issue with a simple image built with buildah (walk.c is the same c code I pasted above):

buildah bud --no-cache --platform linux/amd64 -f - /tmp <<'EOF'
FROM alpine:edge
RUN apk update && apk add --upgrade build-base libc-dev linux-headers
COPY walk.c /app/walk.c
RUN gcc -o /app/walk /app/walk.c
ENTRYPOINT ["/app/walk"]
EOF

Running the above container gives the same Permission denied error.

This seems to be fixed in podman now: containers/podman#10337 (comment)

docker seems to be affected, too, not fixed yet: docker/for-linux#1262

This happened for me while trying to run Ubuntu MATE 21.10 inside x11docker 6.6.2 with usual command x11docker --desktop ubuntu-mate:impish --sudouser --dbus --clipboard --pulseaudio --xtest --init=systemd which worked normally for 21.04.
Now it can't start MATE Terminal with bash inside it:

x11docker

So I have reported bug 1935995 to launchpad.

Currently I'm running x11docker version 6.9.0 and Docker CE version 20.10.73-0ubuntu-bionic . I can't understand how to change my startup options to include -- --security-opt seccomp=unconfined. Could anyone, please help me?

I can't understand how to change my startup options to include -- --security-opt seccomp=unconfined

Put the image name to the end of the command.
Add -- --security-opt seccomp=unconfined -- before the image name.

x11docker --desktop --sudouser --dbus --clipboard --pulseaudio --xtest --init=systemd -- --security-opt seccomp=unconfined -- ubuntu-mate:impish 

This follows the syntax for custom RUN_OPTIONS:

  x11docker [OPTIONS] -- RUN_OPTIONS -- IMAGE [COMMAND [ARG1 ARG2 ...]]

Thanks! Now it works. I will document it somewhere.

Found elsewhere:

updating "crun" from 0.10 to 0.19 fixed the issue.

If the issue still occurs with (default) runc, it might help to switch to crun.

I just assume that this issue is meanwhile fixed in most distributions.

huw commented

@mviereck I just ran into this on Debian 12, maybe they’ve updated the vte3 dependency since Debian 11?