Ability to kill the process after a timeout if it fails to stop gracefully
turip opened this issue · 7 comments
One of tilt's extension docker_build_with_restart which is used for tilt's hot reload implementation uses entr to detect changes and restart the sevices.
The issue that I am struggling within our system is that some services under some rare conditions fail to shut down gracefully (e.g. SIGTERM). This can be viewed as a bug in the service itself, that should be fixed, however, in case of a development environment, I would want the devenv to be able to kill the service so that the hot reload functionality doesn't seem broken.
I have seen that in #121 different signals will be implemented, however, in this specific development-environment specific use-case it would be useful if entr
would be configurable to:
- If entr sends the
SIGTERM
(or whatever signal specified) and - If the child process fails to terminate in the defined amount of seconds (e.g. we don't get the SIGCHILD), then
- Then a sigkill is sent to the child process
I think the -t
is argument is still free, and it should be 0 by default disabling this approach.
A timeout to send SIGKILL
certainly could be implemented, but I'm not sure this will solve the problem. Have you been able to capture the process state of services that failed to terminate? Try this command to get the process state and process group
ps -x -o stat,tpgid,pid,pgrp,cmd
Unfortunately, I haven't captured the process state, however, this is a go application that we have. In this setup a single pid is visible, and some background gothreads related to cleanup are hanging.
The process doesn't fork any child processes, so I doubt if any funky process states would be present.
This means that the:
- SIGTERM is caught by the single process running
- The process starts the shutdown procedure, however some internal gothread hangs, thus it never completes
If you want I can reproduce this in a synthetic environment, but what would be visible is just the single pid running there endlessly.
I created an experimental branch that adds -t
to set a timeout (in seconds).
@turip. Let me know if this change seems to solve the problem.
To build:
git fetch origin restart-timeout
git checkout restart-timeout
./configure
make
Thank you, this works great. 💯
I have used this simple app to emulate the hanging cleanup handlers: https://gist.github.com/turip/c2969e4a5e78cc4cc58f731c3fee531f
# without -t the test go app hangs:
root@e0009a12c0a2:/test# touch "/.restart-proc" && echo "/.restart-proc" | ./entr/entr -rz ./testhanging
2023-08-18 07:25:21.904459865 +0000 UTC m=+0.000222000 application starting
2023-08-18 07:25:21.905361362 +0000 UTC m=+0.001123456 awaiting signal
# In a separate shell: date > /.restart-proc
2023-08-18 07:25:27.660574057 +0000 UTC m=+5.756336234 recevied terminated
# with -t it receives the kill signal
root@e0009a12c0a2:/test# touch "/.restart-proc" && echo "/.restart-proc" | ./entr/entr -rz -t 5 ./testhanging
2023-08-18 07:25:07.293147418 +0000 UTC m=+0.000212959 application starting
2023-08-18 07:25:07.293347959 +0000 UTC m=+0.000413500 awaiting signal
# In a separate shell: date > /.restart-proc
2023-08-18 07:25:09.273900701 +0000 UTC m=+1.980966284 recevied terminated
# application killed as expected
2023-08-18 07:25:14.276496949 +0000 UTC m=+0.000162834 application starting
2023-08-18 07:25:14.277028281 +0000 UTC m=+0.000694124 awaiting signal
Can you, please tag a new release with this new feature if you have the time?
This test program was actually very helpful!
I was trying to write a system test in shell
#!/bin/sh
trap 'echo "caught signal"; exit' TERM
echo "running"; sleep 60
entr
is able to restart this even though it catches SIGTERM
because entr
does not simply signal the child process: it creates a process group and signals the process group with killpg(3)
.
Since your test program is a single binary, it is able to catch the signal and block until SIGKILL
is sent.
I won't be cutting a new release with this feature very soon--I want to think more about how it should integrate with #121 since they are related.
As I noted in the #121 the timeout(1) utility already seems to have the features we need
entr -r timeout -k 4 0 ./catch
Except -k
doesn't seem to work, instead of waiting for the specified period of time it sends SIGKILL
immediately
-k time, --kill-after=time
Send a second signal, SIGKILL, if the command is still running
time after the first signal was sent.