[BUG] RESQUE_PRE_SHUTDOWN_TIMEOUT not working at all on Heroku
thromera opened this issue · 20 comments
Hi,
I'm coming back again about the RESQUE_PRE_SHUTDOWN_TIMEOUT
env variable, it's not working at all after doing some tests on a staging environment.
Here is the configuration I've made:
#Procfile
one_worker_and_scheduler: WORKER_COUNT=1 bin/start-stunnel bin/worker_and_scheduler
#bin/worker_and_scheduler
#!/bin/bash
(bundle exec rake resque:scheduler) & (env COUNT=$WORKER_COUNT QUEUE=* bundle exec rake resque:workers)
# Heroku environment variables:
RESQUE_PRE_SHUTDOWN_TIMEOUT=6
RESQUE_TERM_TIMEOUT=6
TERM_CHILD=1
#jobs/test_job.rb
class TestJob
@queue = :system_events
def self.perform
count = 1
loop do
puts "TESTJOB: #{count}"
sleep(1)
count += 1
end
end
end
# In rails console on staging:
Resque.enqueue TestJob
#=> true
#another terminal tab, logs on this specific dyno are showing:
# heroku logs --remote staging -t --dyno one_worker_and_scheduler
2017-04-12T14:55:40.973899+00:00 app[one_worker_and_scheduler.1]: TESTJOB: 1
2017-04-12T14:55:41.974026+00:00 app[one_worker_and_scheduler.1]: TESTJOB: 2
2017-04-12T14:55:42.974199+00:00 app[one_worker_and_scheduler.1]: TESTJOB: 3
2017-04-12T14:55:43.974346+00:00 app[one_worker_and_scheduler.1]: TESTJOB: 4
2017-04-12T14:55:44.974519+00:00 app[one_worker_and_scheduler.1]: TESTJOB: 5
2017-04-12T14:55:45.974694+00:00 app[one_worker_and_scheduler.1]: TESTJOB: 6
2017-04-12T14:55:46.974857+00:00 app[one_worker_and_scheduler.1]: TESTJOB: 7
2017-04-12T14:55:47.975032+00:00 app[one_worker_and_scheduler.1]: TESTJOB: 8
2017-04-12T14:55:48.975194+00:00 app[one_worker_and_scheduler.1]: TESTJOB: 9
Then I do this in another terminal tab:
heroku restart --remote staging
And in my worker logs, I can see this:
2017-04-12T14:55:49.975723+00:00 app[one_worker_and_scheduler.1]: TESTJOB: 10
2017-04-12T14:55:50.724713+00:00 heroku[one_worker_and_scheduler.1]: Restarting
2017-04-12T14:55:50.725252+00:00 heroku[one_worker_and_scheduler.1]: State changed from up to starting
2017-04-12T14:55:50.975897+00:00 app[one_worker_and_scheduler.1]: TESTJOB: 11
2017-04-12T14:55:51.282979+00:00 heroku[one_worker_and_scheduler.1]: Stopping all processes with SIGTERM
2017-04-12T14:55:51.290502+00:00 app[one_worker_and_scheduler.1]: buildpack=stunnel at=app-end
2017-04-12T14:55:51.290636+00:00 app[one_worker_and_scheduler.1]: buildpack=stunnel at=exit process=app
2017-04-12T14:55:51.290678+00:00 app[one_worker_and_scheduler.1]: rake aborted!
2017-04-12T14:55:51.290772+00:00 app[one_worker_and_scheduler.1]: SignalException: SIGTERM
2017-04-12T14:55:51.291114+00:00 app[one_worker_and_scheduler.1]: /app/vendor/bundle/ruby/2.3.0/gems/resque-1.27.3/lib/resque/tasks.rb:37:in `join'
2017-04-12T14:55:51.291117+00:00 app[one_worker_and_scheduler.1]: /app/vendor/bundle/ruby/2.3.0/gems/resque-1.27.3/lib/resque/tasks.rb:37:in `block (3 levels) in <top (required)>'
2017-04-12T14:55:51.291118+00:00 app[one_worker_and_scheduler.1]: /app/vendor/bundle/ruby/2.3.0/gems/resque-1.27.3/lib/resque/tasks.rb:37:in `each'
2017-04-12T14:55:51.291118+00:00 app[one_worker_and_scheduler.1]: /app/vendor/bundle/ruby/2.3.0/gems/resque-1.27.3/lib/resque/tasks.rb:37:in `block (2 levels) in <top (required)>'
2017-04-12T14:55:51.291119+00:00 app[one_worker_and_scheduler.1]: /app/vendor/bundle/ruby/2.3.0/gems/bundler-1.13.7/lib/bundler/cli/exec.rb:74:in `load'
2017-04-12T14:55:51.291120+00:00 app[one_worker_and_scheduler.1]: /app/vendor/bundle/ruby/2.3.0/gems/bundler-1.13.7/lib/bundler/cli/exec.rb:74:in `kernel_load'
2017-04-12T14:55:51.291120+00:00 app[one_worker_and_scheduler.1]: /app/vendor/bundle/ruby/2.3.0/gems/bundler-1.13.7/lib/bundler/cli/exec.rb:27:in `run'
2017-04-12T14:55:51.291121+00:00 app[one_worker_and_scheduler.1]: /app/vendor/bundle/ruby/2.3.0/gems/bundler-1.13.7/lib/bundler/cli.rb:332:in `exec'
2017-04-12T14:55:51.291122+00:00 app[one_worker_and_scheduler.1]: /app/vendor/bundle/ruby/2.3.0/gems/bundler-1.13.7/lib/bundler/vendor/thor/lib/thor/command.rb:27:in `run'
2017-04-12T14:55:51.291122+00:00 app[one_worker_and_scheduler.1]: /app/vendor/bundle/ruby/2.3.0/gems/bundler-1.13.7/lib/bundler/vendor/thor/lib/thor/invocation.rb:126:in `invoke_command'
2017-04-12T14:55:51.291123+00:00 app[one_worker_and_scheduler.1]: /app/vendor/bundle/ruby/2.3.0/gems/bundler-1.13.7/lib/bundler/vendor/thor/lib/thor.rb:359:in `dispatch'
2017-04-12T14:55:51.291123+00:00 app[one_worker_and_scheduler.1]: /app/vendor/bundle/ruby/2.3.0/gems/bundler-1.13.7/lib/bundler/cli.rb:20:in `dispatch'
2017-04-12T14:55:51.291124+00:00 app[one_worker_and_scheduler.1]: /app/vendor/bundle/ruby/2.3.0/gems/bundler-1.13.7/lib/bundler/vendor/thor/lib/thor/base.rb:440:in `start'
2017-04-12T14:55:51.291125+00:00 app[one_worker_and_scheduler.1]: /app/vendor/bundle/ruby/2.3.0/gems/bundler-1.13.7/lib/bundler/cli.rb:11:in `start'
2017-04-12T14:55:51.291125+00:00 app[one_worker_and_scheduler.1]: /app/vendor/bundle/ruby/2.3.0/gems/bundler-1.13.7/exe/bundle:34:in `block in <top (required)>'
2017-04-12T14:55:51.291126+00:00 app[one_worker_and_scheduler.1]: /app/vendor/bundle/ruby/2.3.0/gems/bundler-1.13.7/lib/bundler/friendly_errors.rb:100:in `with_friendly_errors'
2017-04-12T14:55:51.291126+00:00 app[one_worker_and_scheduler.1]: /app/vendor/bundle/ruby/2.3.0/gems/bundler-1.13.7/exe/bundle:26:in `<top (required)>'
2017-04-12T14:55:51.291127+00:00 app[one_worker_and_scheduler.1]: /app/bin/bundle:3:in `load'
2017-04-12T14:55:51.291127+00:00 app[one_worker_and_scheduler.1]: /app/bin/bundle:3:in `<main>'
2017-04-12T14:55:51.291131+00:00 app[one_worker_and_scheduler.1]: Tasks: TOP => resque:workers
2017-04-12T14:55:51.291172+00:00 app[one_worker_and_scheduler.1]: (See full trace by running task with --trace)
2017-04-12T14:55:51.291776+00:00 app[one_worker_and_scheduler.1]: buildpack=stunnel at=kill-app pid=20
2017-04-12T14:55:51.292117+00:00 app[one_worker_and_scheduler.1]: bin/start-stunnel: line 49: kill: (-20) - No such process
2017-04-12T14:55:51.292195+00:00 app[one_worker_and_scheduler.1]: buildpack=stunnel at=wait-app pid=20
2017-04-12T14:55:51.292314+00:00 app[one_worker_and_scheduler.1]: buildpack=stunnel at=kill-aux name=stunnel pid=19 signal=SIGINT
2017-04-12T14:55:51.292859+00:00 app[one_worker_and_scheduler.1]: buildpack=stunnel at=stunnel-end
2017-04-12T14:55:51.381566+00:00 heroku[one_worker_and_scheduler.1]: Process exited with status 0
2017-04-12T14:56:00.901580+00:00 heroku[one_worker_and_scheduler.1]: Starting process with command `WORKER_COUNT=1 bin/start-stunnel bin/worker_and_schedule
As you can see, there is no delay between
2017-04-12T14:55:50.725252+00:00 heroku[one_worker_and_scheduler.1]: State changed from up to starting
2017-04-12T14:55:50.975897+00:00 app[one_worker_and_scheduler.1]: TESTJOB: 11
2017-04-12T14:55:51.282979+00:00 heroku[one_worker_and_scheduler.1]: Stopping all processes with SIGTERM
and the job actually being killed completly.
When I read this: https://github.com/resque/resque#signals-on-heroku
It seems that the delay should always be < 10 seconds (The time Heroku send a SIGKILL).
Am I doing something wrong or it does not work?
Happy to give more information, as it's quite critical for a bunch of people out there I believe.
Thank you!
@Erowlin I think I figured out the problem here:
- On my heroku config
term_signal && fork_per_job? == true
- Because of this state
unregister_signal_handlers
is run on the child (i.e. forked) process. This modifies thetrap
logic to throwTermException
immediately when it receives aTERM
signal. - If a
TERM
signal is ONLY ever sent to the parent process, this works great. The parent catches the signal, waits the right amount of time, and then passes the signal down to the worker - However, it looks heroku iterates over ALL processes running and sends
TERM
immediately. This is definitely the case on cedar and heroku-16. Here's the test application https://gist.github.com/iloveitaly/7d16735ad395ea1399e8d115b4295aa2 - Since the signal is send to the child process immediately the
TERM
handler inunregister_signal_handlers
is run triggering aTermException
which skips all of the logic innew_kill_child
This MonkeyPatch looks to solve the issue for me:
Resque::Worker.class_eval do
def unregister_signal_handlers
trap('TERM') do
trap('TERM') do
puts "[resque-override] received second term signal from parent, throwing termexception"
trap('TERM') do
puts "[resque-override] ignoring > 2 term signal"
end
raise Resque::TermException.new("SIGTERM")
end
puts "[resque-override] received first term signal from heroku, ignoring"
end
trap('INT', 'DEFAULT')
begin
trap('QUIT', 'DEFAULT')
trap('USR1', 'DEFAULT')
trap('USR2', 'DEFAULT')
rescue ArgumentError
end
end
end
I'm not sure how to make this into a PR without implementing a heroku-specific flag. Most other non-heroku platforms/devops architectures only signal the parent process.
I've asked heroku support if there's a way for them to respect the process hierarchy and only signal the parent process. I'll let you know if I find anything.
Thanks for the info, @iloveitaly. Hopefully heroku can resolve this on their end.
@chrisccerami My hunch is they don't have any options to change how their infrastructure handles this. I'll let you know when I hear back.
Any idea how to best handle this case? I'm thinking a heroku flag would be the only way to cleanly handle this, right?
Thanks for your answer @iloveitaly. It makes sense actually. However, I'm not sure we can do much on our side. Should we try to sync with the Heroku team and see with them what are our option?
@chrisccerami @Erowlin Just heard back from heroku support: there's no way to change how their platform signals processes. What do you guys think about some sort of feature flag to handle this case? Signals don't have any metadata so I don't think there's a way to know where the signal is coming from.
Also, I've updated the above snipped with some improved log statement and a namespace bug fix.
I'm not sure I get your point about the feature flag. Could you provide an example @iloveitaly ?
I believe he means passing in a flag if you're using heroku, so that Resque can handle shutdown the way heroku expects, without interfering with other platforms.
@chrisccerami that's exactly right—it looks like we need a heroku-specific codepath to adjust how signals are handled when the app is hosted on heroku.
Ah got it, it makes sense actually. It also means that we have to implement a specific "logic" within the Resque code to properly handle the Heroku behaviour, or it already exists in the Resque codebase?
-> Update I believe we have to use your monkey patch until it's merged?
-> Update I believe we have to use your monkey patch until it's merged?
That's right!
@chrisccerami I know there aren't platform-specific feature flags right now in the resque codebase. Any opinions on the best way to implement this?
I'll have to put some thought into this (perhaps next week when I have a week off), but I'm open to ideas about how we can implement this in a way that doesn't make the code too messy for the generic case.
Sounds good! I hope you don't spend time on this on your week off though!
What about implementing the heroku specific code in a plugin such as resque-heroku
, that would allow us to not mess with the generic case in the main code base.
Just a random thought.
What about implementing the heroku specific code in a plugin such as resque-heroku, that would allow us to not mess with the generic case in the main code base.
I like this a lot! Downside would be having to monkeypatch core heroku. However, if the resque-heroku
gem locks to a x.x.x
resque version that would eliminate the risk of patching the wrong code, or the logic for that underlying code changing across releases without us knowing this.
Pulling out platform specific flags into a separate gem does seem to make the most the sense to me. I can go ahead and do this unless anyone else has other thoughts!
I'm happy with that solution.
Hey All! I've created a gem to wrap this patch:
I'm curious if anyone here has experienced this same issue with Amazon ECS, I'm having a hard time rescuing Resque::TermException
to retry the job. so I'm assuming the child process on ECS is also being directly signalled. From ECS documentation when a task is killed it's basically like sending a docker stop
which from its documentation says it sends SIGTERM to the root process which should be the parent worker. Would be helpful if anyone knows what the case with Amazon ECS is!
I'm currently debugging the same issue. So far it seems like the TermException, which should get raised in the trap context of the forked child is not passed up to the parent.
It might be related that we also fork another process in our child, but when i remember correctly, back the day it was working.
I was able to reproduce by starting a resque worker inside a docker container and than send a term signal to the container:
docker kill --signal=SIGTERM <container_id>
I'd expect to have a TermException be raise, but it does not. when trying to raise it via a own trap handle
trap('TERM') do
Resque.logger.info { "RECEIVED TERMSIGNAL" }
raise TermException.new("SIGTERM")
Resque.logger.info { "RECEIVED TERMSIGNAL after" }
end
I get the first log entry, but the exception is never raised and, of course, also not the second log entry.
On a superficial read, doesn't the undocumented graceful_term
flag (e5a2ee7) make Resque compatible with Heroku without the need of resque-heroku-signals
? /cc @iloveitaly
I mean, provided GRACEFUL_TERM
is set and TERM_CHILD
unset, you'd get @shutdown
set, the worker would wait for the job to finish, and the job could handle SIGTERM
. Everything killed past 30 seconds. This would be the same shutdown experience you'd get in an arbitrary dyno.
(You would lose Resque's child termination logic.)
I believe the solution implemented by https://github.com/resque/resque-heroku-signals is fundamentally flawed, because it ignores the documented caveat that processes can get SIGTERM
multiple times:
Please note that it is currently possible that processes in a dyno that is being shut down may receive multiple SIGTERMs
The assumption in that gem is that the second SIGTERM
comes from the worker, but this is not guaranteed. I have seen it happening in tests with Resque workers (you get Resque::TermException
before RESQUE_PRE_SHUTDOWN_TIMEOUT
seconds), and this is a minimal Ruby script where you can see it:
I am thinking about alternative approaches.