Ability to kill the process after a timeout if it fails to stop gracefully

Question

Ability to kill the process after a timeout if it fails to stop gracefully

turip opened this issue a year ago · 7 comments

One of tilt's extension docker_build_with_restart which is used for tilt's hot reload implementation uses entr to detect changes and restart the sevices.

The issue that I am struggling within our system is that some services under some rare conditions fail to shut down gracefully (e.g. SIGTERM). This can be viewed as a bug in the service itself, that should be fixed, however, in case of a development environment, I would want the devenv to be able to kill the service so that the hot reload functionality doesn't seem broken.

I have seen that in #121 different signals will be implemented, however, in this specific development-environment specific use-case it would be useful if entr would be configurable to:

If entr sends the SIGTERM (or whatever signal specified) and
If the child process fails to terminate in the defined amount of seconds (e.g. we don't get the SIGCHILD), then
Then a sigkill is sent to the child process

I think the -t is argument is still free, and it should be 0 by default disabling this approach.

Answer 1 · 2023-08-10T16:57:31.000Z

A timeout to send SIGKILL certainly could be implemented, but I'm not sure this will solve the problem. Have you been able to capture the process state of services that failed to terminate? Try this command to get the process state and process group

ps -x -o stat,tpgid,pid,pgrp,cmd

Answer 2 · 2023-08-11T12:14:42.000Z

Unfortunately, I haven't captured the process state, however, this is a go application that we have. In this setup a single pid is visible, and some background gothreads related to cleanup are hanging.

The process doesn't fork any child processes, so I doubt if any funky process states would be present.

This means that the:

SIGTERM is caught by the single process running
The process starts the shutdown procedure, however some internal gothread hangs, thus it never completes

If you want I can reproduce this in a synthetic environment, but what would be visible is just the single pid running there endlessly.

Answer 3 · 2023-08-16T02:09:04.000Z

I created an experimental branch that adds -t to set a timeout (in seconds).

@turip. Let me know if this change seems to solve the problem.

To build:

git fetch origin restart-timeout
git checkout restart-timeout
./configure
make

Answer 4 · 2023-08-18T07:29:44.000Z

Thank you, this works great. 💯

I have used this simple app to emulate the hanging cleanup handlers: https://gist.github.com/turip/c2969e4a5e78cc4cc58f731c3fee531f

# without -t the test go app hangs:
root@e0009a12c0a2:/test# touch "/.restart-proc" && echo "/.restart-proc" | ./entr/entr -rz  ./testhanging 
2023-08-18 07:25:21.904459865 +0000 UTC m=+0.000222000 application starting
2023-08-18 07:25:21.905361362 +0000 UTC m=+0.001123456 awaiting signal
#  In a separate shell:  date > /.restart-proc
2023-08-18 07:25:27.660574057 +0000 UTC m=+5.756336234 recevied terminated

# with -t it receives the kill signal
root@e0009a12c0a2:/test# touch "/.restart-proc" && echo "/.restart-proc" | ./entr/entr -rz -t 5 ./testhanging 
2023-08-18 07:25:07.293147418 +0000 UTC m=+0.000212959 application starting
2023-08-18 07:25:07.293347959 +0000 UTC m=+0.000413500 awaiting signal
#  In a separate shell:  date > /.restart-proc
2023-08-18 07:25:09.273900701 +0000 UTC m=+1.980966284 recevied terminated
# application killed as expected
2023-08-18 07:25:14.276496949 +0000 UTC m=+0.000162834 application starting
2023-08-18 07:25:14.277028281 +0000 UTC m=+0.000694124 awaiting signal

Can you, please tag a new release with this new feature if you have the time?

Answer 5 · 2023-08-18T18:03:11.000Z

This test program was actually very helpful!

I was trying to write a system test in shell

#!/bin/sh
trap 'echo "caught signal"; exit' TERM
echo "running"; sleep 60

entr is able to restart this even though it catches SIGTERM because entr does not simply signal the child process: it creates a process group and signals the process group with killpg(3).

Since your test program is a single binary, it is able to catch the signal and block until SIGKILL is sent.

I won't be cutting a new release with this feature very soon--I want to think more about how it should integrate with #121 since they are related.

Answer 6 · 2023-09-15T13:04:19.000Z

As I noted in the #121 the timeout(1) utility already seems to have the features we need

entr -r timeout -k 4 0 ./catch

Except -k doesn't seem to work, instead of waiting for the specified period of time it sends SIGKILL immediately

     -k time, --kill-after=time
             Send a second signal, SIGKILL, if the command is still running
             time after the first signal was sent.

Answer 7 · 2023-09-15T14:04:46.000Z

It appears that the timeout command on OpenBSD has a bug, but this works with GNU Coreutils. Closing. @turip let me know how this works for you