/rootlesskit

Linux-native "fake root" for rootless containers

Primary LanguageGoApache License 2.0Apache-2.0

RootlessKit: Linux-native fakeroot using user namespaces

RootlessKit is a Linux-native implementation of "fake root" using user_namespaces(7). The purpose of RootlessKit is to run Docker and Kubernetes as an unprivileged user (known as "Rootless mode"), so as to protect the real root on the host from potential container-breakout attacks.

What RootlessKit actually does

RootlessKit creates user_namespaces(7) and mount_namespaces(7), and executes newuidmap(1)/newgidmap(1) along with subuid(5) and subgid(5).

RootlessKit also supports isolating network_namespaces(7) with userspace NAT using "slirp". Kernel-mode NAT using SUID-enabled lxc-user-nic(1) is also experimentally supported.

Similar projects

Tools based on LD_PRELOAD (not enough to run rootless containers and yet lacks support for static binaries):

Tools based on ptrace(2) (not enough to run rootless containers and yet slow):

Tools based on user_namespaces(7) (as in RootlessKit, but without support for --copy-up, --net, ...):

Projects using RootlessKit

Container engines:

Container image builders:

  • BuildKit: Next-generation docker build backend

Kubernetes distributions:

  • Usernetes: Docker & Kubernetes, installable under a non-root user's $HOME.
  • k3s: Lightweight Kubernetes

Setup

$ go get github.com/rootless-containers/rootlesskit/cmd/rootlesskit
$ go get github.com/rootless-containers/rootlesskit/cmd/rootlessctl

or just run make to make binaries under ./bin directory.

Requirements

  • newuidmap and newgidmap need to be installed on the host. These commands are provided by the uidmap package on most distributions.

  • /etc/subuid and /etc/subgid should contain more than 65536 sub-IDs. e.g. penguin:231072:65536. These files are automatically configured on most distributions.

$ id -u
1001
$ whoami
penguin
$ grep "^$(whoami):" /etc/subuid
penguin:231072:65536
$ grep "^$(whoami):" /etc/subgid
penguin:231072:65536

Distribution-specific hints

Debian (excluding Ubuntu):

  • sudo sh -c "echo 1 > /proc/sys/kernel/unprivileged_userns_clone" is required

Arch Linux:

  • sudo sh -c "echo 1 > /proc/sys/kernel/unprivileged_userns_clone" is required

RHEL/CentOS 7 (excluding RHEL/CentOS 8):

  • sudo sh -c "echo 28633 > /proc/sys/user/max_user_namespaces" is required

To persist sysctl configurations, edit /etc/sysctl.conf or add a file under /etc/sysctl.d.

Usage

Inside rootlesskit, your UID is mapped to 0 but it is not the real root:

$ rootlesskit bash
rootlesskit$ id
uid=0(root) gid=0(root) groups=0(root),65534(nogroup)
rootlesskit$ ls -l /etc/shadow
-rw-r----- 1 nobody nogroup 1050 Aug 21 19:02 /etc/shadow
rootlesskit$ $ cat /etc/shadow
cat: /etc/shadow: Permission denied

Environment variables are kept untouched:

$ rootlesskit bash
rootlesskit$ echo $USER
penguin
rootlesskit$ echo $HOME
/home/penguin
rootlesskit$ echo $XDG_RUNTIME_DIR
/run/user/1001

Filesystems can be isolated from the host with --copy-up:

$ rootlesskit --copy-up=/etc bash
rootlesskit$ rm /etc/resolv.conf
rootlesskit$ vi /etc/resolv.conf

You can even create network namespaces with Slirp:

$ rootlesskit --copy-up=/etc --copy-up=/run --net=slirp4netns --disable-host-loopback bash
rootlesskit$ ip netns add foo
...

Proc filesystem view:

$ rootlesskit bash
rootlesskit$ cat /proc/self/uid_map
         0       1001          1
         1     231072      65536
rootlesskit$ cat /proc/self/gid_map
         0       1001          1
         1     231072      65536
rootlesskit$ cat /proc/self/setgroups
allow

Full CLI options

NAME:
   rootlesskit - Linux-native fakeroot using user namespaces

USAGE:
   rootlesskit [global options] [arguments...]

VERSION:
   0.10.0

DESCRIPTION:
   RootlessKit is a Linux-native implementation of "fake root" using user_namespaces(7).

   Web site: https://github.com/rootless-containers/rootlesskit

   Examples:
     # spawn a shell with a new user namespace and a mount namespace
     rootlesskit bash

     # make /etc writable
     rootlesskit --copy-up=/etc bash

     # set mount propagation to rslave
     rootlesskit --propagation=rslave bash

     # create a network namespace with slirp4netns, and expose 80/tcp on the namespace as 8080/tcp on the host
     rootlesskit --copy-up=/etc --net=slirp4netns --disable-host-loopback --port-driver=builtin -p 127.0.0.1:8080:80/tcp bash

   Note: RootlessKit requires /etc/subuid and /etc/subgid to be configured by the real root user.

GLOBAL OPTIONS:
   --debug                      debug mode (default: false)
   --state-dir value            state directory
   --net value                  network driver [host, slirp4netns, vpnkit, lxc-user-nic(experimental)] (default: "host")
   --slirp4netns-binary value   path of slirp4netns binary for --net=slirp4netns (default: "slirp4netns")
   --slirp4netns-sandbox value  enable slirp4netns sandbox (experimental) [auto, true, false] (the default is planned to be "auto" in future) (default: "false")
   --slirp4netns-seccomp value  enable slirp4netns seccomp (experimental) [auto, true, false] (the default is planned to be "auto" in future) (default: "false")
   --vpnkit-binary value        path of VPNKit binary for --net=vpnkit (default: "vpnkit")
   --lxc-user-nic-binary value  path of lxc-user-nic binary for --net=lxc-user-nic (default: "/usr/lib/x86_64-linux-gnu/lxc/lxc-user-nic")
   --lxc-user-nic-bridge value  lxc-user-nic bridge name (default: "lxcbr0")
   --mtu value                  MTU for non-host network (default: 65520 for slirp4netns, 1500 for others) (default: 0)
   --cidr value                 CIDR for slirp4netns network (default: 10.0.2.0/24)
   --disable-host-loopback      prohibit connecting to 127.0.0.1:* on the host namespace (default: false)
   --copy-up value              mount a filesystem and copy-up the contents. e.g. "--copy-up=/etc" (typically required for non-host network)
   --copy-up-mode value         copy-up mode [tmpfs+symlink] (default: "tmpfs+symlink")
   --port-driver value          port driver for non-host network. [none, builtin, slirp4netns, socat(deprecated)] (default: "none")
   --publish value, -p value    publish ports. e.g. "127.0.0.1:8080:80/tcp"
   --pidns                      create a PID namespace (default: false)
   --cgroupns                   create a cgroup namespace (default: false)
   --utsns                      create a UTS namespace (default: false)
   --ipcns                      create an IPC namespace (default: false)
   --propagation value          mount propagation [rprivate, rslave] (default: "rprivate")
   --help, -h                   show help (default: false)
   --version, -v                print the version (default: false)

State directory

The following files will be created in the state directory, which can be specified with --state-dir:

  • lock: lock file
  • child_pid: decimal PID text that can be used for nsenter(1).
  • api.sock: REST API socket for rootlessctl. See Port Drivers section.

If --state-dir is not specified, RootlessKit creates a temporary state directory on /tmp and removes it on exit.

Undocumented files are subject to change.

Environment variables

The following environment variables will be set for the child process:

  • ROOTLESSKIT_STATE_DIR (since v0.3.0): absolute path to the state dir
  • ROOTLESSKIT_PARENT_EUID (since v0.8.0): effective UID
  • ROOTLESSKIT_PARENT_EGID (since v0.8.0): effective GID

Undocumented environment variables are subject to change.

PID Namespace

When --pidns (since v0.5.0) is specified, RootlessKit executes the child process in a new PID namespace. The RootlessKit child process becomes the init (PID=1). When RootlessKit terminates, all the processes in the namespace are killed with SIGKILL.

See also pid_namespaces(7).

Mount Propagation

The mount namespace created by RootlessKit has rprivate propagation by default.

Starting with v0.9.0, the propagation can be set to rslave by specifying --propagation=rslave.

The propagation can be also set to rshared, but known not to work with --copy-up.

Note that rslave and rshared do not work as expected when the host root filesystem isn't mounted with "shared". (Use findmnt -n -l -o propagation / to inspect the current mount flag.)

Network Drivers

RootlessKit provides several drivers for providing network connectivity:

  • --net=host: use host network namespace (default)
  • --net=slirp4netns: use slirp4netns (recommended)
  • --net=vpnkit: use VPNKit
  • --net=lxc-user-nic: use lxc-user-nic (experimental)

Benchmark: iperf3 from the child to the parent (Mar 8, 2020):

Driver MTU=1500 MTU=65520
slirp4netns 1.06 Gbps 7.55 Gbps
slirp4netns (with sandbox + seccomp) 1.05 Gbps 7.21 Gbps
vpnkit 0.60 Gbps (Unsupported)
lxc-user-nic 31.4 Gbps 30.9 Gbps
(rootful veth) (38.7 Gbps) (40.8 Gbps)

--net=host (default)

--net=host does not isolate the network namespace from the host.

Pros:

  • No performance overhead
  • Supports ICMP Echo (ping) when /proc/sys/net/ipv4/ping_group_range is configured

Cons:

  • No permission for network-namespaced operations, e.g. creating iptables rules, running tcpdump

To route ICMP Echo packets (ping), you need to write the range of GIDs to net.ipv4.ping_group_range.

$ sudo sh -c "echo 0   2147483647  > /proc/sys/net/ipv4/ping_group_range"

--net=slirp4netns (recommended)

--net=slirp4netns isolates the network namespace from the host and launch slirp4netns for providing usermode networking.

Pros:

  • Possible to perform network-namespaced operations, e.g. creating iptables rules, running tcpdump
  • Supports ICMP Echo (ping) when /proc/sys/net/ipv4/ping_group_range is configured
  • Supports hardening using mount namespace and seccomp (--slirp4netns-sandbox=auto, --slirp4netns-seccomp=auto, since RootlessKit v0.7.0, slirp4netns v0.4.0)

Cons:

  • Extra performance overhead (but still faster than --net=vpnkit)
  • Supports only TCP, UDP, and ICMP Echo packets

To use --net=slirp4netns, you need to install slirp4netns v0.4.0 or later.

$ sudo dnf install slirp4netns

or

$ sudo apt-get install slirp4netns

If binary package is not available for your distribution, install from the source:

$ git clone https://github.com/rootless-containers/slirp4netns
$ cd slirp4netns
$ ./autogen.sh && ./configure && make
$ cp slirp4netns ~/bin

The network is configured as follows by default:

  • IP: 10.0.2.100/24
  • Gateway: 10.0.2.2
  • DNS: 10.0.2.3

The network configuration can be changed by specifying custom CIDR, e.g. --cidr=10.0.3.0/24 (requires slirp4netns v0.3.0+).

Specifying --copy-up=/etc is highly recommended unless /etc/resolv.conf on the host is statically configured. Otherwise /etc/resolv.conf in the RootlessKit's mount namespace will be unmounted when /etc/resolv.conf on the host is recreated, typically by NetworkManager or systemd-resolved.

It is also highly recommended to specyfy--disable-host-loopback. Otherwise ports listening on 127.0.0.1 in the host are accessible as 10.0.2.2 in the RootlessKit's network namespace.

Example session:

$ rootlesskit --net=slirp4netns --copy-up=/etc --disable-host-loopback bash
rootlesskit$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520 qdisc fq_codel state UP group default qlen 1000
    link/ether 46:dc:8d:09:fd:f2 brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.100/24 scope global tap0
       valid_lft forever preferred_lft forever
    inet6 fe80::44dc:8dff:fe09:fdf2/64 scope link
       valid_lft forever preferred_lft forever
ootlesskit$ ip r
default via 10.0.2.2 dev tap0
10.0.2.0/24 dev tap0 proto kernel scope link src 10.0.2.100
rootlesskit$ cat /etc/resolv.conf 
nameserver 10.0.2.3
rootlesskit$ curl https://www.google.com
<!doctype html><html ...>...</html>

Starting with RootlessKit v0.7.0 + slirp4netns v0.4.0, --slirp4netns-sandbox=auto/true/false (enables mount namespace) and --slirp4netns-seccomp=auto/true/false (enables seccomp rules) can be used to harden the slirp4netns process.

--net=vpnkit

--net=vpnkit isolates the network namespace from the host and launch VPNKit for providing usermode networking.

Pros:

  • Possible to perform network-namespaced operations, e.g. creating iptables rules, running tcpdump

Cons:

  • Extra performance overhead
  • Supports only TCP and UDP packets. No support for ICMP Echo (ping) unlike --net=slirp4netns, even if /proc/sys/net/ipv4/ping_group_range is configured.

To use --net=vpnkit, you need to install VPNkit.

$ git clone https://github.com/moby/vpnkit.git
$ cd vpnkit
$ make
$ cp vpnkit.exe ~/bin/vpnkit

The network is configured as follows by default:

  • IP: 192.168.65.3/24
  • Gateway: 192.168.65.1
  • DNS: 192.168.65.1

As in --net=slirp4netns, specifying --copy-up=/etc and --disable-host-loopback is highly recommended. If --disable-host-loopback is not specified, ports listening on 127.0.0.1 in the host are accessible as 192.168.65.2 in the RootlessKit's network namespace.

--net=lxc-user-nic (experimental)

--net=lxc-user-nic isolates the network namespace from the host and launch lxc-user-nic(1) SUID binary for providing kernel-mode NAT.

Pros:

  • The least performance overhead
  • Possible to perform network-namespaced operations, e.g. creating iptables rules, running tcpdump
  • Supports ICMP Echo (ping) without /proc/sys/net/ipv4/ping_group_range configuration

Cons:

  • Less secure
  • Needs /etc/lxc/lxc-usernet configuration

To use lxc-user-nic, you need to install liblxc-common package:

$ sudo apt-get install liblxc-common

You also need to set up /etc/lxc/lxc-usernet:

# USERNAME TYPE BRIDGE COUNT
penguin    veth lxcbr0 1

The COUNT value needs to be increased to run multiple RootlessKit instances with --net=lxc-user-nic simultaneously.

It may take a few seconds to configure the interface using DHCP.

If you start and stop RootlessKit too frequently, you might use up all available DHCP addresses. You might need to reset /var/lib/misc/dnsmasq.lxcbr0.leases and restart the lxc-net service.

Currently, the MAC address is always set to a random address.

Port Drivers

To the ports in the network namespace to the host network namespace, --port-driver needs to be specified.

The default value is none (do not expose ports).

--port-driver Throughput Source IP
slirp4netns 6.89 Gbps Propagated
socat (Deprecated) 7.80 Gbps Always 127.0.0.1
builtin 30.0 Gbps Always 127.0.0.1

(Benchmark: iperf3 from the parent to the child (Mar 8, 2020))

The builtin driver is fastest, but be aware that the source IP is not propagated and always set to 127.0.0.1.

Exposing ports

For example, to expose 80 in the child as 8080 in the parent:

$ rootlesskit --state-dir=/run/user/1001/rootlesskit/foo --net=slirp4netns --disable-host-loopback --copy-up=/etc --port-driver=builtin bash
rootlesskit$ rootlessctl --socket=/run/user/1001/rootlesskit/foo/api.sock add-ports 0.0.0.0:8080:80/tcp
1
rootlesskit$ rootlessctl --socket=/run/user/1001/rootlesskit/foo/api.sock list-ports
ID    PROTO    PARENTIP   PARENTPORT    CHILDPORT    
1     tcp      0.0.0.0    8080          80
rootlesskit$ rootlessctl --socket=/run/user/1001/rootlesskit/foo/api.sock remove-ports 1
1

You can also expose ports using socat and nsenter instead of RootlessKit's port drivers.

$ pid=$(cat /run/user/1001/rootlesskit/foo/child_pid)
$ socat -t -- TCP-LISTEN:8080,reuseaddr,fork EXEC:"nsenter -U -n -t $pid socat -t -- STDIN TCP4\:127.0.0.1\:80"

Exposing privileged ports

To expose privileged ports (< 1024), add net.ipv4.ip_unprivileged_port_start=0 to /etc/sysctl.conf (or /etc/sysctl.d) and run sudo sysctl --system.

If you are using builtin driver, you can expose the privileged ports without changing the sysctl value, but you need to set CAP_NET_BIND_SERVICE on rootlesskit binary.

$ sudo setcap cap_net_bind_service=ep $(pwd rootlesskit)