Support new pid namespace or ability to kill all children
ibuildthecloud opened this issue · 12 comments
This is a follow up to our discussion at DockerCon.
Rootlesskit (and the usernetes work) has been integrated it k3s but results in several usability issues. For now it seems the simplest approach to making rootless with k3s more user friendly would be if rootkit could ensure some way that if the children of executed process died when the process died. This could be done by creating a new pid namespace or possibly with cgroup (although I don't know if rootless can modify cgroups).
I'm just curious, why Pdeathsig doesn't work?
Minimal reproducer for the issue
// Package main provides a reproducer for the PdeathSig issue.
//
// ```
// Terminal1$ ./pdeathsig-demo tail -f /dev/null
//
// Terminal2$ ps u
// ...
// suda 5705 0.0 0.0 102776 1844 pts/1 Sl+ 02:47 0:00 ./pdeathsig-demo tail -f /dev/null
// suda 5709 0.0 0.0 102776 1520 pts/1 Sl+ 02:47 0:00 /proc/self/exe tail -f /dev/null
// suda 5714 0.0 0.0 8068 820 pts/1 S+ 02:47 0:00 tail -f /dev/null
// ...
// Terminal2$ kill 5705
// Terminal2$ ps u
// ...
// suda 5714 0.0 0.0 8068 820 pts/1 S 02:47 0:00 tail -f /dev/null
// ...
// ```
//
// The `tail -f /dev/null` process is expected to die, but actually it doesn't, due to some reason
// related to the newuidmap binary (SUID).
//
// When SKIP_NEWUIDMAP=1 is set, the `tail -f /dev/null` process dies as expected.
package main
import (
"os"
"os/exec"
"strconv"
"syscall"
)
func main() {
if err := xmain(); err != nil {
panic(err)
}
}
func xmain() error {
if os.Getenv("child") != "" {
return child()
}
return parent()
}
func parent() error {
cmd := exec.Command("/proc/self/exe", os.Args[1:]...)
cmd.Env = append(os.Environ(), "child=1")
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
cmd.SysProcAttr = &syscall.SysProcAttr{
Pdeathsig: syscall.SIGKILL,
Cloneflags: syscall.CLONE_NEWUSER,
}
if err := cmd.Start(); err != nil {
return err
}
if os.Getenv("SKIP_NEWUIDMAP") != "1" {
if err := setupUIDMap(cmd.Process.Pid); err != nil {
return err
}
}
return cmd.Wait()
}
func child() error {
cmd := exec.Command(os.Args[1], os.Args[2:]...)
cmd.Stdin = os.Stdin
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
cmd.SysProcAttr = &syscall.SysProcAttr{
Pdeathsig: syscall.SIGKILL,
}
return cmd.Run()
}
func setupUIDMap(pid int) error {
// newuidmap typically has SUID bit
cmd := exec.Command("newuidmap", strconv.Itoa(pid), "0", strconv.Itoa(os.Geteuid()), "1")
return cmd.Run()
}
@cyphar @giuseppe
Is this by design or bug (of Go? kernel?) ?
Is there any workaround?
PR for the Pdeathsig issue: #66
@ibuildthecloud
Does it work for you?
@cyphar @giuseppe
Is this by designor bug (of Go? kernel?) ?
Is there any workaround?
it seems the kernel resets the flag when the credentials for the process are changed. The prctl(PR_SET_PDEATHSIG) should be done once the user namespace is configured and the setresuid completed. It is trivial in C, but I am not sure how to achieve it in Go, maybe re-execing the process once in the namespace?
#66 seems working, though not sure correct
Released v0.4.1.
I'm closing this issue but let me know if there is still problem.
reopening, as we need a way to kill grandchildren under the rootlesskit child process...
PR for creating PIDNS: #67
With the merging of #67 for the --pidns
flag, does that mean that this section of the docs is no longer valid?
Once you kill K3s and then start a new instance of K3s it will create a new network namespace, but it doesn’t kill the old pods. So you are left with a fairly broken setup.
@magikid I'm wondering the same too, since they link to this issue.
The usage of Pdeathsig
in #66 isn't quite right. It will guarantee that the child dies, but it won't ensure that the parent doesn't accidentally kill the child. See golang/go#27505 for details.
The correct way to use Pdeathsig
would be to launch a goroutine, lock it to the OS thread, start the child in that goroutine, and don't exit the goroutine until the child exits. There's a more complete example in the issue I linked above.
FWIW, if you're launching lots of children, it might make sense to use a single goroutine locked to an OS thread to launch all of the potential children.
Thanks, opened #182 for tracking