google/nsjail

nsjail usage of execveat failing under Debian

FrancisRussell opened this issue · 0 comments

I've been trying to use nsjail in STANDALONE_EXECVE mode along with --execute_fd since the binary to be executed doesn't exist in the chroot. Unfortunately, I'm finding that the same set of steps works on some systems but not others. Currently the only commonality between the failing systems has been the fact that they're running Debian. The following script attempts to reproduce the issue (requires nsjail, xz, curl, which and mktemp):

#!/usr/bin/env bash
set -eu

CHROOT="/tmp/nsjail-ubuntu-focal"
if ! [ -d "${CHROOT}" ]; then
  UNPACK_DIR=$(mktemp -p /tmp -d)
  trap "rm -rf \"${UNPACK_DIR}\"" EXIT HUP INT TERM
  curl -s https://cloud-images.ubuntu.com/minimal/releases/focal/release-20210720/ubuntu-20.04-minimal-cloudimg-amd64-root.tar.xz | tar -xJ --exclude="dev/*" -C "${UNPACK_DIR}"
  mv "${UNPACK_DIR}" "${CHROOT}"
fi
echo "Environment: "
env
echo -n "System: "
uname -a
echo -n "nsjail location: "
which nsjail
nsjail -Me -c "${CHROOT}" --execute_fd -- "${CHROOT}/bin/echo" "Hello world!"

In the working case, this would be expected to print Hello world!. On failing systems, output is as follows (I've removed the environment variable dump). This run is on a x86_64 system running Debian testing running a 5.10.46 kernel.

System: Linux kvasir 5.10.0-8-amd64 #1 SMP Debian 5.10.46-4 (2021-08-03) x86_64 GNU/Linux
nsjail location: /nix/store/1pxcc157f0brr4hp0x4nvqav8fd3r416-nsjail-3.0/bin/nsjail
[I][2021-08-26T17:37:05+0100] Mode: STANDALONE_EXECVE
[I][2021-08-26T17:37:05+0100] Jail parameters: hostname:'NSJAIL', chroot:'/tmp/nsjail-ubuntu-focal', process:'/tmp/nsjail-ubuntu-focal/bin/echo', bind:[::]:0, max_conns_per_ip:0, time_limit:0, personality:0, daemonize:false, clone_newnet:true, clone_newuser:true, clone_newns:true, clone_newpid:true, clone_newipc:true, clone_newuts:true, clone_newcgroup:true, keep_caps:false, disable_no_new_privs:false, max_cpus:0
[I][2021-08-26T17:37:05+0100] Mount: '/tmp/nsjail-ubuntu-focal' -> '/' flags:MS_RDONLY|MS_BIND|MS_REC|MS_PRIVATE type:'' options:'' dir:true
[I][2021-08-26T17:37:05+0100] Mount: '/proc' flags:MS_RDONLY type:'proc' options:'' dir:true
[I][2021-08-26T17:37:05+0100] Uid map: inside_uid:1000 outside_uid:1000 count:1 newuidmap:false
[I][2021-08-26T17:37:05+0100] Gid map: inside_gid:1000 outside_gid:1000 count:1 newgidmap:false
[I][2021-08-26T17:37:05+0100] Executing '/tmp/nsjail-ubuntu-focal/bin/echo' for '[STANDALONE MODE]'
[E][2021-08-26T17:37:05+0100][74145] void subproc::subprocNewProc(nsjconf_t*, int, int, int, int, int)():205 execve('/tmp/nsjail-ubuntu-focal/bin/echo') failed: No such file or directory
[F][2021-08-26T17:37:05+0100][74145] bool subproc::runChild(nsjconf_t*, int, int, int, int)():429 Launching new process failed

Meanwhile, I've seen successful runs on Ubuntu, Arch and WSL2 Ubuntu systems. Things also execute correctly on a failing system if I pass --disable_clone_newns to nsjail. My own attempt to add debugging information to nsjail suggests that the file descriptor being passed to execveat is valid. The issue also occurs if the binary is statically linked (I tried using a statically linked busybox as the process to execute).

I'm aware that in the early days, Debian added some patches that meant certain namespacing functionality could only be enabled by a kernel option (https://lwn.net/Articles/673597/), however I do not believe that any such "hardening" exists now. I'm somewhat at a loss, and am unsure if this behaviour is due to a kernel, or some other environmental factor.