openSUSE/catatonit

catatonit hangs due to signal coalescing

Closed this issue · 1 comments

Steps to reproduce:

$ CATATONIT_DEBUG=1 catatonit -- bash -c "bash -c 'sleep 0.01 & kill -9 \$BASHPID'; sleep 0.0087"
DEBUG (catatonit:24487): pid1 (24488) spawned: bash
bash: line 1: 24489 Killed                  bash -c 'sleep 0.01 & kill -9 $BASHPID'
DEBUG (catatonit:24487): child process 24488 exited with code 0
DEBUG (catatonit:24487): child process 24491 exited with code 0
DEBUG (catatonit:24487): got ECHILD: no children left to monitor
(hangs forever)

You may have to fiddle with the final sleep value to reproduce. Binary search for a number such that you see one "child process exited" message half the time and two half the time, and then run it repeatedly until it freezes. (The goal is for the inner sleep to end just before the outer one. I'm sure you could write a fully reliable reproduction using a freezer cgroup or something like that.)

When multiple children of catatonit terminate at nearly the same time, signal coalescing means that only one SIGCHLD is guaranteed to be delivered, and it might not be the one corresponding to the "pid1" process. catatonit correctly runs waitpid in a loop and reaps all of the terminated processes, but it only notices pid1 termination when that SIGCHLD happens to be the one that is delivered.

I think the "special case for pid1" needs to be moved into the loop in reap_zombies().

Yeah, that seems likes a reasonable fix. Implemented in e4c14f5, which appears to fix the issue based on my testing. Feel free to ping me if the fix was insufficient.