Placating Murderous Unicorn Worker Assassins
The Unicorn project is pretty adamant about not expanding features around timing out slow request workers. I think a little more could be done, but it's their project and they have a point: timeouts managed from outside the thing taking too long are fraught with peril, particularly in Ruby. The general recommendation is to time out the Right Way by using the timeout features built in to components known to take a long time.
What if you don't know what's taking a long time because the main unicorn process comes along and murders your poor unicorn workers just trying to get their job done? Combing through large code-bases to search for where someone might have forgotten to set a timeout on a long-running call is a lot of work with no guarantee that the stopwatch-wielding unicorn murderer will be appeased.
Yes, you should not rely on unicorn's timeout to save your processes, but you can teach your unicorn workers union members to attempt to write death notes before their demise so that you can get a clue as to where in your application there is some procedure feeling unrestrained by time.
Running this Repro
You'll need:
- Tilt
- Docker
- curl
- a local Ruby environment
Steps:
bundle install
- To see OpenTelemetry tracing at Honeycomb, set an environment variable HONEYCOMB_API_KEY with a key before running the next command.
If you don't have or want Honeycomb, don't.
Trace data will still appear, but not as pretty, in the console output of the
otel collector
service in Tilt. tilt up
- Ruby's tilt might have taken precidence on the PATH, so you may need to
/usr/local/bin/tilt up
or/opt/homebrew/bin/tilt up
.
- Ruby's tilt might have taken precidence on the PATH, so you may need to
The Steps This Project Took
- Add rack-timeout to your project despite how it might hose your server. We'll deal with the hosing.
- Set rack-timeout to what you think is a reasonable wait period for a request (see
Tiltfile
and theRACK_TIMEOUT_*
environment variables set for the app).- At the end of this time, rack-timeout's middleware injects an exception into the current thread. Exceptions injected by outside forces risks your worker processes entering a corrupt state. These workers should get retired and replaced. Keep going for how to do that more humanely.
- Set unicorn's worker timeout higher than that. (see
config/unicorn.rb
) - Set rack-timeout's
term_on_timeout
option to a number of times a worker process may timeout before the process is sent a SIGTERM. Unicorn traps TERM, the worker is not ended gracefully.- Set to 1 and rack-timeout sends TERM right before injecting the thread exception, but the worker process will not have enough time to ship any telemetry about the exception out to you.
- Set to higher than 1 and rack-timeout will still inject the thread exception to end the request early, probably returning a 500 response, but TERM won't be sent until the worker process times out again, up to the number set for this option. With TERM delayed to a future timeout, at least telemetry about the previous allowed exceptions ought to have enough time to transmit. You'll miss telemetry about the last one, though.
- Optional, but this increases the gracefulness of a worker's demise:
- Add an
at_exit
block to your application responsible for shipping whatever telemetry you have; in this app, that's OpenTelemetry traces (seeconfig/initializers/opentelemetry.rb
). - Patch rack-timeout to send SIGQUIT instead of SIGTERM. Unicorn traps QUIT and allows the worker to exit gracefully. Your patch should probably live in a fork you maintain, even though fork-maintenance is painful. You should not patch it like I did in this problem-reproduction project.
- With rack-timeout sending SIGQUIT, rack-timeout's
term_on_timeout
can be set to 1, theat_exit
block will take care of shipping telemetry about the long-running code path, and the worker will be replaced by a fresh process with a clean initialized state.
- Add an