Graceful shutdown request

Question

Graceful shutdown request

pjanetzek opened this issue 2 years ago · 5 comments

Hey Kevin I hope you are doing well. I'm currently working on my next tile-server so naturally recalled your project from tangram-paparazzi days for the server and worker implementation :)

It's not a big deal but I wonder if there is a way to shutdown prime_server worker threads gracefully. Even with quiesce and long shutdown wait time there is only a small chance that worker destructors being called. Since the worker holds a postgres connection it would probably be nice to have it sign-out properly.

Looking at production use cases brought up this TODO
https://github.com/valhalla/valhalla/blob/master/src/odin/worker.cc#L131
// TODO: should we listen for SIGINT and terminate gracefully/exit(0)?

Answer 1 · 2023-04-11T20:45:12.000Z

Could it be that worker_t::work() needs to be interrupted by the SIGTERM handler to wake up, check shutdown state, and end its loop? I'm an absolute beginner with zmq so any pointers what to read/look into would be helpful

Answer 2 · 2023-04-11T21:02:53.000Z

hey great to hear from you aga8n!

maybe the best place to start would be an example that shows the problem?

the worker is on the main thread and it monitors a boolean that a detached thread controls. the detached thread is the one with the signal handler. I think the problem is the detached thread calls exit. it's been a while since I worked on this but I could imagine that exiting from the detached thread could make it so that the main thread doesn't destruct the worker in time.

what value are you using for the shutdown time? if its short or if your jobs are long or even if your destructor is long these could all cause a problem.

with a sample program we can illustrate this with well placed sleeps and prints I'm sure. but my question is which bit is the unexpected behavior? the contract is that you set a drain and shutdown time that suits your needs, meaning you have to know about how much time you need and add a bit of a buffer.

Answer 3 · 2023-04-14T09:46:12.000Z

Hey thanks Kevin!
I've resolved the issue for me now. There were a few things coming together: The use of std::bind for setting the thread context didn't work for non-copyable worker contexts. Then I've made the worker thread a member of the worker context, but this didn't get cleaned up because server_thread.join() would never return (pjanetzek@dc94438).

What you think of this change to show how to use per-worker context with threads?
pjanetzek@9bd14c4#diff-e8517519d7779489e858dfec86c5645264f954c8e91b687a25c385ed0c0eda0dR143

Answer 4 · 2023-04-14T09:51:51.000Z

It might be a nice addition to change the example to show how the same code can be used for a monolithic dev server with threads and for separate services managed by systemd e.g. - I would give it a try :)

Answer 5 · 2023-04-14T20:30:28.000Z

ah, that second one is a good catch about the poll timeout. maybe it should be min(request_timeout, shutdown_delay) or maybe half that to make sure its not a near miss?

the first thing you mentioned im not sure i understand what the problem was or why it was happening at first glace. you are saying that after the polling fix, it still didnt call your destructor because we are binding to a temporary instance of the worker?

and yeah, totally, im happy to update the code/examples to make it so that others don't fall into the same traps that you've found!