Many container runtime tools like systemd-nspawn
, docker
,
etc. focus on providing infrastructure for system administrators and
orchestration tools (e.g. Kubernetes) to run containers.
These tools are not suitable to give to unprivileged users, because it is trivial to turn such access into to a fully privileged root shell on the host.
There is an effort in the Linux kernel called user namespaces which attempts to allow unprivileged users to use container features. While significant progress has been made, there are still concerns about it, and it is not available to unprivileged users in several production distributions such as CentOS/Red Hat Enterprise Linux 7, Debian Jessie, etc.
See for example CVE-2016-3135 which is a local root vulnerability introduced by userns. This March 2016 post has some more discussion.
Bubblewrap could be viewed as setuid implementation of a subset of user namespaces. Emphasis on subset - specifically relevant to the above CVE, bubblewrap does not allow control over iptables.
The original bubblewrap code existed before user namespaces - it inherits code from xdg-app helper which in turn distantly derives from linux-user-chroot.
The maintainers of this tool believe that it does not, even when used in combination with typical software installed on that distribution, allow privilege escalation. It may increase the ability of a logged in user to perform denial of service attacks, however.
In particular, bubblewrap uses PR_SET_NO_NEW_PRIVS
to turn off
setuid binaries, which is the traditional way to get out of things
like chroots.
This program can be shared by all container tools which perform non-root operation, such as:
We would also like to see this be available in Kubernetes/OpenShift clusters. Having the ability for unprivileged users to use container features would make it significantly easier to do interactive debugging scenarios and the like.
bubblewrap works by creating a new, completely empty, mount namespace where the root is on a tmpfs that is invisible from the host, and will be automatically cleaned up when the last process exists. You can then use commandline options to construct the root filesystem and process environment and command to run in the namespace.
A simple example is
bwrap --ro-bind / / bash
This will create a read-only bind mount of the host root at the sandbox root, and then start a bash.
Another simple example would be a read-write chroot operation:
bwrap --bind /some/chroot/dir / bash
A more complex example is to run a with a custom (readonly) /usr, but your own (tmpfs) data, running in a PID and network namespace:
bwrap --ro-bind /usr /usr \
--tmpfs /tmp \
--proc /proc \
--dev /dev \
--ro-bind /etc/resolv.conf /etc/resolv.conf \
--symlink usr/lib /lib \
--symlink usr/lib64 /lib64 \
--symlink usr/bin /bin \
--symlink usr/sbin /sbin \
--chdir / \
--unshare-pid \
--unshare-net \
--dir /run/user/$(id -u) \
--setenv XDG_RUNTIME_DIR "/run/user/`id -u`" \
/bin/sh
The goal of bubblewrap is to run an application in a sandbox, where it has restricted access to parts of the operating system or user data such as the home directory.
bubblewrap always creates a new mount namespace, and the user can specify
exactly what parts of the filesystem should be visible in the sandbox.
Any such directories you specify mounted nodev
by default, and can be made readonly.
Additionally you can use these kernel features:
User namespaces (CLONE_NEWUSER): This hides all but the current uid and gid from the sandbox. You can also change what the value of uid/gid should be in the sandbox.
IPC namespaces (CLONE_NEWIPC): The sandbox will get its own copy of all the different forms of IPCs, like SysV shared memory and semaphores.
PID namespaces (CLONE_NEWPID): The sandbox will not see any processes outside the sandbox. Additionally, bubblewrap will run a trivial pid1 inside your container to handle the requirements of reaping children in the sandbox. .This avoids what is known now as the Docker pid 1 problem.
Network namespaces (CLONE_NEWNET): The sandbox will not see the network. Instead it will have its own network namespace with only a loopback device.
UTS namespace (CLONE_NEWUTS): The sandbox will have its own hostname.
Seccomp filters: You can pass in seccomp filters that limit which syscalls can be done in the sandbox. For more information, see Seccomp.
Firejail is similar to xdg-app before bubblewrap was split out in that it combines a setuid tool with a lot of desktop-specific sandboxing features. For example, Firejail knows about Pulseaudio, whereas bubblewrap does not.
The bubblewrap authors believe it's much easier to audit a small setuid program, and keep features such as Pulseaudio filtering as an unprivileged process, as now occurs in xdg-app.
Also, @cgwalters thinks trying to
whitelist file paths
is a bad idea given the myriad ways users have to manipulate paths,
and the myriad ways in which system administrators may configure a
system. The bubblewrap approach is to only retain a few specific
Linux capabilities such as CAP_SYS_ADMIN
, but to always access the
filesystem as the invoking uid. This entirely closes
TOCTOCU attacks and
such.
Sandstorm.io also has a setuid helper process. @cgwalters believes their setuid code is fairly good, but it could still make sense to unify on bubblewrap as a setuid core. That hasn't been ruled out, but neither is it being actively pursued today.
runc is similar to systemd nspawn in that it is tooling intended to be invoked by root. There is an effort to have runc optionally use user namespaces, but no plans for any setuid support.
The bubblewrap authors believe that runc and systemd-nspawn are not designed to be made setuid and are distant from supporting such a mode.
binctr is just a wrapper for runc, so inherits all of its design tradeoffs.
The name bubblewrap was chosen to convey that this tool runs as the parent of the application (so wraps it in some sense) and creates a protective layer (the sandbox) around it.
(Bubblewrap cat by dancing_stupidity)