Failed to fdwalk: Operation not permitted (close_range syscall / seccomp issue)

Question

Failed to fdwalk: Operation not permitted (close_range syscall / seccomp issue)

mviereck opened this issue 4 years ago · 18 comments

mviereck commented 4 years ago

Coming from #345

@smac89 wrote:

I am currently in the middle of reporting a bug with the terminal running inside the container.

You wouldn't happen to have an idea of why this might be happening inside the container? 😄

Answer 1 · 2021-04-22T09:57:39.000Z

This is odd. It seems zsh wants to do some privileged setup.

x11docker drops all capabilities with --cap-drop=ALL --security-opt=no-new-privileges. Please try with --cap-default to get the default capabilities that docker or podman would give to a container.

Edit: Just tried with xfce 4.16 on debian testing and could not reproduce. Works as expected.

Answer 2 · 2021-04-23T22:18:26.000Z

Ok this is now resolved. See https://gitlab.xfce.org/apps/xfce4-terminal/-/issues/116#note_30805.

The problem seems to be coming from vte3>=0.63.91. I'm not fully convinced this is the problem because I'm on archlinux, and we have the latest vte3 (0.64.0) installed, and xfce4-terminal works no problem. In retrospect, we are currently on a stable branch, and not building from master, so that might be a contributing factor?

I will make a PR to fix the problem once I fully understand what is causing it. As for now, I will rebuild the image to use vte3==0.62.3, which does not seem to affect the terminal.

Edit: Just tried with xfce 4.16 on debian testing and could not reproduce. Works as expected.

I tried an Ubuntu image built from source, and that one worked, but when I checked which version of vte3 they use, I saw they are using 0.60.3, so this is probably the same version you have on debian.

Answer 3 · 2021-04-23T22:39:08.000Z

This is odd. It seems zsh wants to do some privileged setup.

That's the deceptive part about that message that led me on a chase for privileges to add to the container. I added all the privileges I could, but none of them was enough to get past the issue.

I finally arrived to the actual line in vte's source code which produces that error. I don't want to bore you with how I got to it, but the call to open the terminal starts somewhere here (from xfce4-terminal):

https://github.com/xfce-mirror/xfce4-terminal/blob/d082cec41b23b85131f0275229d846b1488c85fe/terminal/terminal-screen.c#L1984

And makes its way into vte3 which leads to:
https://github.com/GNOME/vte/blob/a5817b18ec2e770cfea0e51ae3350943c3287956/src/spawn.cc#L727

and...

https://github.com/GNOME/vte/blob/109a6cf6e05ef55b79f768a30fdf95723ebba0d3/src/spawn.cc#L352-L353
then...
https://github.com/GNOME/vte/blob/a5817b18ec2e770cfea0e51ae3350943c3287956/src/missing.cc#L161
and finally
https://github.com/GNOME/vte/blob/a5817b18ec2e770cfea0e51ae3350943c3287956/src/missing.cc#L176

^ that's where everything blows up.

The syscall to close_range is apparently unallowed inside docker containers. You can reliably reproduce it on your end with this code (taken from here):

#define _GNU_SOURCE
#include <fcntl.h>
#include <linux/close_range.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/syscall.h>
#include <string.h>
#include <unistd.h>
#include <dirent.h>

/* Show the contents of the symbolic links in /proc/self/fd */

static void
show_fds(void)
{
   DIR *dirp = opendir("/proc/self/fd");
   if (dirp  == NULL) {
       perror("opendir");
       exit(EXIT_FAILURE);
   }

   for (;;) {
       struct dirent *dp = readdir(dirp);
       if (dp == NULL)
           break;

       if (dp->d_type == DT_LNK) {
           char path[PATH_MAX], target[PATH_MAX];
           snprintf(path, sizeof(path), "/proc/self/fd/%s",
                    dp->d_name);

           ssize_t len = readlink(path, target, sizeof(target));
           printf("%s ==> %.*s\n", path, (int) len, target);
       }
   }

   closedir(dirp);
}

int
main(int argc, char *argv[])
{
   for (int j = 1; j < argc; j++) {
       int fd = open(argv[j], O_RDONLY);
       if (fd == -1) {
           perror(argv[j]);
           exit(EXIT_FAILURE);
       }
       printf("%s opened as FD %d\n", argv[j], fd);
   }

   show_fds();

   printf("========= About to call close_range() =======\n");

   if (syscall(__NR_close_range, 3, ~0U, 0) == -1) {
       perror("close_range");
       exit(EXIT_FAILURE);
   }

   show_fds();
   exit(EXIT_SUCCESS);
}

Run it by supplying a bunch of files as arguments. All it does is open all the files, then use close_range syscall to close them.

This code runs fine on my host machine, but in a container, it fails with (you guessed it!) "Permission denied".

Answer 4 · 2021-04-23T22:48:02.000Z

In conclusion, the problem is not with zsh (I actually tried it with bash login shell, and the same problem occurred), or x11docker, but rather with a close_range syscall done by vte3, which is used by xfce4-terminal

Answer 5 · 2021-04-24T08:07:32.000Z

Issue reported here: https://github.com/containers/podman/issues/10130

I assume this is related to seccomp profiles, but I'm not sure yet

Answer 6 · 2021-04-24T14:24:14.000Z

I assume this is related to seccomp profiles, but I'm not sure yet

Thank you for the detailed report!
It might be worth to run with --security-opt seccomp=unconfined like:

x11docker --desktop -- --security-opt seccomp=unconfined -- x11docker/xfce

Answer 7 · 2021-04-24T16:36:40.000Z

Yep. I can confirm that does in fact fix the problem. Do you know why this works? I was under the impression that docker's default seccomp profile should not have interfered with that syscall, because close_range is whitelisted in that file.

I even downloaded that json file and used it to run the container, and it was still blocking close_range. Unless podman is simply ignoring that flag, I'm not sure.

Do you know the difference between --security-opt seccomp=unconfined and --privileged?

Answer 8 · 2021-04-24T16:52:15.000Z

Do you know the difference between --security-opt seccomp=unconfined and --privileged?

Found the answer here:

Using the --privileged flag when creating a container with docker run disables seccomp in all versions of docker - even if you explicitly specify a seccomp profile. In general you should avoid using the --privileged flag as it does too many things. You can achieve the same goal with --cap-add ALL --security-opt apparmor=unconfined --security-opt seccomp=unconfined. If you need access to devices use -ice.

So I guess security-wise, it may not be all that bad using just --security-opt seccomp=unconfined

Answer 9 · 2021-04-24T18:55:27.000Z

I even downloaded that json file and used it to run the container, and it was still blocking close_range.

That is odd.
x11docker also sets --security-opt label=type:container_runtime_t to allow access to shared X unix sockets. I wonder if this somehow interferes and disallows close_range. I doubt it, however, I barely understand SELinux at all.

So I guess security-wise, it may not be all that bad using just --security-opt seccomp=unconfined

I think so, too. My debian system does not use SELinux at all.

Answer 10 · 2021-04-25T17:31:18.000Z

I've confused SELinux and seccomp here. They are different security layers. I am not familiar with both of them.

--security-opt seccomp=unconfined disables the seccomp profile for containers. Likely somewhere in this profile close_range is forbidden or not whitelisted.

Edit:
Some documentation: https://docs.docker.com/engine/security/seccomp/
The default docker seccomp profile whitelists close_range: https://github.com/moby/moby/blob/master/profiles/seccomp/default.json

Answer 11 · 2021-04-25T22:26:24.000Z

I use podman, and their default profile also whitelists close_range.

That is odd.
x11docker also sets --security-opt label=type:container_runtime_t to allow access to shared X unix sockets. I wonder if this somehow interferes and disallows close_range. I doubt it, however, I barely understand SELinux at all.

I think the next course of action is to run the container without x11docker. This should once and for all remove all blame from x11docker. Once I get the chance, I will do that and report my findings here

I doubt it too. In the bug report I created at containers/podman#10337, I was able to reproduce the issue with a simple image built with buildah (walk.c is the same c code I pasted above):

buildah bud --no-cache --platform linux/amd64 -f - /tmp <<'EOF'
FROM alpine:edge
RUN apk update && apk add --upgrade build-base libc-dev linux-headers
COPY walk.c /app/walk.c
RUN gcc -o /app/walk /app/walk.c
ENTRYPOINT ["/app/walk"]
EOF

Running the above container gives the same Permission denied error.

Answer 12 · 2021-06-27T07:18:49.000Z

This seems to be fixed in podman now: containers/podman#10337 (comment)

docker seems to be affected, too, not fixed yet: docker/for-linux#1262

Answer 13 · 2021-07-13T21:01:22.000Z

This happened for me while trying to run Ubuntu MATE 21.10 inside x11docker 6.6.2 with usual command x11docker --desktop ubuntu-mate:impish --sudouser --dbus --clipboard --pulseaudio --xtest --init=systemd which worked normally for 21.04.
Now it can't start MATE Terminal with bash inside it:

So I have reported bug 1935995 to launchpad.

Currently I'm running x11docker version 6.9.0 and Docker CE version 20.10.7~~3-0~~ubuntu-bionic . I can't understand how to change my startup options to include -- --security-opt seccomp=unconfined. Could anyone, please help me?

Answer 14 · 2021-07-13T21:07:59.000Z

I can't understand how to change my startup options to include -- --security-opt seccomp=unconfined

Put the image name to the end of the command.
Add -- --security-opt seccomp=unconfined -- before the image name.

x11docker --desktop --sudouser --dbus --clipboard --pulseaudio --xtest --init=systemd -- --security-opt seccomp=unconfined -- ubuntu-mate:impish

This follows the syntax for custom RUN_OPTIONS:

  x11docker [OPTIONS] -- RUN_OPTIONS -- IMAGE [COMMAND [ARG1 ARG2 ...]]

Answer 15 · 2021-07-13T21:16:46.000Z

Thanks! Now it works. I will document it somewhere.

Answer 16 · 2021-11-22T08:13:32.000Z

Found elsewhere:

updating "crun" from 0.10 to 0.19 fixed the issue.

If the issue still occurs with (default) runc, it might help to switch to crun.

Answer 17 · 2022-04-02T20:45:59.000Z

I just assume that this issue is meanwhile fixed in most distributions.

Answer 18 · 2023-05-19T05:05:56.000Z

@mviereck I just ran into this on Debian 12, maybe they’ve updated the vte3 dependency since Debian 11?