Graceful shutdown when receiving the TERM signal

Question

Graceful shutdown when receiving the TERM signal

Closed this issue a year ago · 7 comments

Kubernetes terminates processes in running containers using a TERM signal, and then after terminationGracePeriodSeconds it sends a KILL signal to forcefully shut it down.

My Mosquito deployment is staying active and processing jobs until it receives that KILL, which is common and I also have to work around this in my web apps by closing the HTTP::Server instance. It looks like we would use Mosquito::Runner.stop here, but we don't want to exit immediately because that can leave jobs in a partially processed state.

Instead, if Mosquito::Runner.stop blocks (optionally?) until all currently executing jobs are complete, that should let them shut down gracefully but quickly.

Answer 1 · 2023-11-14T20:42:23.000Z

Yes, you're right Runner#stop will do what you want.

The implementation of this isn't quite universal -- it depends on how you're booting up your worker. The demo script has this code in it:

Signal::INT.trap do
  Mosquito::Runner.stop
end

Mosquito::Runner.start(spin: false)

You could certainly trap Signal::QUIT and TERM instead or in addition. However, Mosquito::Runner.stop is not captive. I don't think that should be a problem -- though it may mean you need to implement some sort of spin lock on your own to wait for shutdown.

Answer 2 · 2023-11-15T18:53:31.000Z

it may mean you need to implement some sort of spin lock on your own to wait for shutdown.

This is one of the reasons I posted this issue, actually. If I start the runner with runner.start(spin: false) and stop it with runner.stop, how do I know when that runner is done? I don't see a way to inspect its running state.

Answer 3 · 2023-11-18T04:17:37.000Z

Interesting, I see what you mean.

The start(spin: false) interface doesn't really please me, but I don't know what I should replace it with... and I think the lack of pattern to mimic leaves me without good vision for what stop should look like. What would you want to do here? The demo script I linked above spins around a check on keep_running, but that is also not well named anymore because the runner has more granularity than simply running and not-running.

Do you have a suggestion of an interface that would accomplish what you're thinking? Can you share what you are currently doing to work-around the lack of functionality?

Answer 4 · 2023-11-18T19:37:16.000Z

I've been trying to come up with something for this for the past couple days. I feel like the default functionality with start is a solid interface (start blocks until it's finished), and maybe both start and stop could block until the runner exits so that regardless of how you structure things it'll just work, but I don't know if the juice is worth the squeeze on that.

I agree that spin: false isn't ideal, though. Since blocking operations in Crystal can be moved to the background with spawn, it may not actually be necessary to solve in Mosquito.

Answer 5 · 2023-11-18T21:22:16.000Z

I have a strong preference for a batteries-included type interface with mosquito, even though it's not all there yet.

I like the idea of a blocking stop command, but I'd probably make it optional as with start. stop(wait: true) would wait for the shutdown before exiting and stop() return immediately.

Regardless, the runner's notion of state needs to be more robust so it can handle at least: running, shutting down, stopped.

Answer 6 · 2023-11-18T22:40:28.000Z

I have a strong preference for a batteries-included type interface with mosquito

💯 For folks who came to Crystal from languages where the convention is to load code at runtime using a CLI provided by a framework, having to write our own entrypoint into the background-job runner which loads all of our jobs is new, so reducing that cognitive load as much as is feasible goes a long way.

Answer 7 · 2023-12-20T00:28:58.000Z

@jgaskins 51904a0 is merged and includes improvements to the Runner interface. You can now call Mosquito::Runner.stop(wait: true) and it will not return until it's finished working.

You can then modify your worker.cr with a signal handler which will respond to SIGINT or whatever is right for your deployment.

See the Runner docs for details.