containers/virtcontainers

Shims related to the same container should be spawned into the same PID namespace

Closed this issue · 6 comments

I was playing with docker recently, starting a container from a first shell:

docker run --runtime=runc --rm -it busybox

and exec'ing an extra process with the following command:

docker exec -it $CONTAINER_ID sh

when I realized that sometimes (I would say 50% of the time), the exec'ed process was not returning after an exit from the container process, and the shell was staying completely stuck until I restart my docker service.

I have investigated by looking at the difference between the logs from a working run vs a not working run, and I have found that when the shim corresponding to the container process was exiting before the shim related to the exec'ed process (basically a child process), we were having the wrong behavior described above.

The rationale behind the issue seems logical to me, because docker expects those processes to run inside a PID namespace, it does expect that the container process should be the last one to exit. Indeed, the container process being the "init" process of the PID namespace, when it terminates for any reason, the kernel SIGKILL all the other processes inside the PID namespace, and the "init" process returns.

Now, why are we getting issues in our case. The thing is that when our container process inside the VM terminates, the guest kernel will kill the processes inside the same PID namespace and they will all return 137 as exit code. But we are still inside the VM here, and we have to pass this information back to the shim (through a proxy in some cases), which means we have no control which shim (corresponding to every process) will exit first.

I have confirmed those investigations by getting this test working all the time by adding a 5s sleep() here in case the process was the container process. This way, I have forced the container process to wait 5s before to return its exit code back to the shim, leaving some times for other shims to return before the container process.

My suggestion for this issue is to solve this in 2 steps:

  • The agent has to ensure every child process started with exec should be waited by the container process so that no process is left behind after the container process returns its exit code.
  • Virtcontainers should start every shim related to the same container into the same PID namespace. This way, when the container process terminates, the host kernel can kill our shim processes related to exec'ed processes, all returning 137 exit code, or we can get the same exit code from the agent (in case the shim received the exit code before to be killed).

@sboeuf -- Good catch. The solutions you propose seem pragmatic to me. Have you opened an issue on the kata agent for this yet? And, should we at least open an issue for cc-agent as well?

@egernst if we agree on this approach, this means we need to open an issue on both Kata and Clear agents, and to solve this one for the virtcontainers side of things.

I have started working on this and I have realized that we might not expect virtcontainers to set every shim into the expected namespaces.
Indeed, it's not really clean to switch virtcontainers current process into the namespace and then spawn the shim from there. Think about all the namespaces (UTS, PID, ...), if we want to enter all of them, this will end up with russian dolls code, where we would enter a first namespace who would enter the second one, who would enter a third one, ..., until we spawn the shim. Such a recursive code is very likely to be buggy according to the complexity of namespaces with Go code. Instead, we should consider implementing this into the shim itself, like @bergwolf started to do this for network ns in Kata shim: kata-containers/shim@16cb3dc
The main drawback using @bergwolf approach here is that we cannot make this work for qemu for instance, as we won't modify the binary itself (unless we write a wrapper binary around the qemu binary). And also this means that every shim/proxy implementation will have to be implemented to support such a thing if we end up trying to put all those processes into specific namespaces.
@sameo WDYT ? I could make the recursive case working if we think this is an OK approach.

This code I've just written based on CNI package ns could do the trick !