rootless-containers/rootlesskit

Support new pid namespace or ability to kill all children

ibuildthecloud opened this issue · 12 comments

This is a follow up to our discussion at DockerCon.

Rootlesskit (and the usernetes work) has been integrated it k3s but results in several usability issues. For now it seems the simplest approach to making rootless with k3s more user friendly would be if rootkit could ensure some way that if the children of executed process died when the process died. This could be done by creating a new pid namespace or possibly with cgroup (although I don't know if rootless can modify cgroups).

I'm just curious, why Pdeathsig doesn't work?

Minimal reproducer for the issue

// Package main provides a reproducer for the PdeathSig issue.
//
// ```
// Terminal1$ ./pdeathsig-demo tail -f /dev/null
//
// Terminal2$ ps u
// ...
// suda       5705  0.0  0.0 102776  1844 pts/1    Sl+  02:47   0:00 ./pdeathsig-demo tail -f /dev/null
// suda       5709  0.0  0.0 102776  1520 pts/1    Sl+  02:47   0:00 /proc/self/exe tail -f /dev/null
// suda       5714  0.0  0.0   8068   820 pts/1    S+   02:47   0:00 tail -f /dev/null
// ...
// Terminal2$ kill 5705
// Terminal2$ ps u
// ...
// suda       5714  0.0  0.0   8068   820 pts/1    S    02:47   0:00 tail -f /dev/null
// ...
// ```
//
// The `tail -f /dev/null` process is expected to die, but actually it doesn't, due to some reason
// related to the newuidmap binary (SUID).
//
// When SKIP_NEWUIDMAP=1 is set, the `tail -f /dev/null` process dies as expected.
package main

import (
        "os"
        "os/exec"
        "strconv"
        "syscall"
)

func main() {
        if err := xmain(); err != nil {
                panic(err)
        }
}

func xmain() error {
        if os.Getenv("child") != "" {
                return child()
        }
        return parent()
}

func parent() error {
        cmd := exec.Command("/proc/self/exe", os.Args[1:]...)
        cmd.Env = append(os.Environ(), "child=1")
        cmd.Stdin = os.Stdin
        cmd.Stdout = os.Stdout
        cmd.Stderr = os.Stderr
        cmd.SysProcAttr = &syscall.SysProcAttr{
                Pdeathsig:  syscall.SIGKILL,
                Cloneflags: syscall.CLONE_NEWUSER,
        }

        if err := cmd.Start(); err != nil {
                return err
        }

        if os.Getenv("SKIP_NEWUIDMAP") != "1" {
                if err := setupUIDMap(cmd.Process.Pid); err != nil {
                        return err
                }
        }
        return cmd.Wait()
}

func child() error {
        cmd := exec.Command(os.Args[1], os.Args[2:]...)
        cmd.Stdin = os.Stdin
        cmd.Stdout = os.Stdout
        cmd.Stderr = os.Stderr
        cmd.SysProcAttr = &syscall.SysProcAttr{
                Pdeathsig: syscall.SIGKILL,
        }
        return cmd.Run()
}

func setupUIDMap(pid int) error {
        // newuidmap typically has SUID bit
        cmd := exec.Command("newuidmap", strconv.Itoa(pid), "0", strconv.Itoa(os.Geteuid()), "1")
        return cmd.Run()
}

@cyphar @giuseppe
Is this by design or bug (of Go? kernel?) ?
Is there any workaround?

PR for the Pdeathsig issue: #66

@ibuildthecloud
Does it work for you?

@cyphar @giuseppe
Is this by design or bug (of Go? kernel?) ?
Is there any workaround?

it seems the kernel resets the flag when the credentials for the process are changed. The prctl(PR_SET_PDEATHSIG) should be done once the user namespace is configured and the setresuid completed. It is trivial in C, but I am not sure how to achieve it in Go, maybe re-execing the process once in the namespace?

#66 seems working, though not sure correct

Released v0.4.1.

I'm closing this issue but let me know if there is still problem.

reopening, as we need a way to kill grandchildren under the rootlesskit child process...

PR for creating PIDNS: #67

With the merging of #67 for the --pidns flag, does that mean that this section of the docs is no longer valid?

Once you kill K3s and then start a new instance of K3s it will create a new network namespace, but it doesn’t kill the old pods. So you are left with a fairly broken setup.

@magikid I'm wondering the same too, since they link to this issue.

The usage of Pdeathsig in #66 isn't quite right. It will guarantee that the child dies, but it won't ensure that the parent doesn't accidentally kill the child. See golang/go#27505 for details.

The correct way to use Pdeathsig would be to launch a goroutine, lock it to the OS thread, start the child in that goroutine, and don't exit the goroutine until the child exits. There's a more complete example in the issue I linked above.

FWIW, if you're launching lots of children, it might make sense to use a single goroutine locked to an OS thread to launch all of the potential children.

Thanks, opened #182 for tracking