openSUSE/catatonit

catatonit hangs when pid1 exits with code 127

Closed this issue · 2 comments

Steps to reproduce:

$ CATATONIT_DEBUG=1 catatonit -- bash -c "exit 127"
DEBUG (catatonit:25051): pid1 (25052) spawned: bash
WARN (catatonit:25051): received SIGCHLD from pid1 (25052) but it's still alive
DEBUG (catatonit:25051): child process 25052 exited with code 127
DEBUG (catatonit:25051): got ECHILD: no children left to monitor

This one is completely reproducible. I found it by making a typo when trying to reproduce the other hanging issue (bash returns 127 for "Command not found"). Exit codes 126 and 128 work fine.

The WARN message makes it pretty clear what the control flow path is. I am not sure that the use of kill() is legitimate: the man page implies that a zombie process "exists" in the relevant sense for kill(pid,0) to succeed, so it may be that if neither WIFEXITED or WIFSIGNALED return true then it will always print the "it's still alive" message and hang. However, I do not know why the WIFEXITED macro would return false just because the exit code is 127!

It's because WIFEXITED (and wait4 in general) bit-packs the type of exit into the status code. I will have to look into this...

According to my testing, this was fixed by the same patch as #4. Let me know if you can still reproduce it with e4c14f5. I don't like how we handle exit signals (especially if killed by a signal) at the moment, but that's an issue for another day.