resque/resque

[BUG] RESQUE_PRE_SHUTDOWN_TIMEOUT not working at all on Heroku

thromera opened this issue · 20 comments

Hi,

I'm coming back again about the RESQUE_PRE_SHUTDOWN_TIMEOUT env variable, it's not working at all after doing some tests on a staging environment.

Here is the configuration I've made:

#Procfile
one_worker_and_scheduler: WORKER_COUNT=1 bin/start-stunnel bin/worker_and_scheduler
#bin/worker_and_scheduler
#!/bin/bash
(bundle exec rake resque:scheduler) & (env COUNT=$WORKER_COUNT QUEUE=* bundle exec rake resque:workers)
# Heroku environment variables: 
RESQUE_PRE_SHUTDOWN_TIMEOUT=6
RESQUE_TERM_TIMEOUT=6
TERM_CHILD=1
#jobs/test_job.rb
class TestJob
  @queue = :system_events

  def self.perform
    count = 1
    loop do
      puts "TESTJOB: #{count}"
      sleep(1)
      count += 1
    end
  end
end
# In rails console on staging: 
Resque.enqueue TestJob
#=> true 
#another terminal tab, logs on this specific dyno are showing:
# heroku logs --remote staging -t --dyno one_worker_and_scheduler
2017-04-12T14:55:40.973899+00:00 app[one_worker_and_scheduler.1]: TESTJOB: 1
2017-04-12T14:55:41.974026+00:00 app[one_worker_and_scheduler.1]: TESTJOB: 2
2017-04-12T14:55:42.974199+00:00 app[one_worker_and_scheduler.1]: TESTJOB: 3
2017-04-12T14:55:43.974346+00:00 app[one_worker_and_scheduler.1]: TESTJOB: 4
2017-04-12T14:55:44.974519+00:00 app[one_worker_and_scheduler.1]: TESTJOB: 5
2017-04-12T14:55:45.974694+00:00 app[one_worker_and_scheduler.1]: TESTJOB: 6
2017-04-12T14:55:46.974857+00:00 app[one_worker_and_scheduler.1]: TESTJOB: 7
2017-04-12T14:55:47.975032+00:00 app[one_worker_and_scheduler.1]: TESTJOB: 8
2017-04-12T14:55:48.975194+00:00 app[one_worker_and_scheduler.1]: TESTJOB: 9

Then I do this in another terminal tab:
heroku restart --remote staging

And in my worker logs, I can see this:

2017-04-12T14:55:49.975723+00:00 app[one_worker_and_scheduler.1]: TESTJOB: 10
2017-04-12T14:55:50.724713+00:00 heroku[one_worker_and_scheduler.1]: Restarting
2017-04-12T14:55:50.725252+00:00 heroku[one_worker_and_scheduler.1]: State changed from up to starting
2017-04-12T14:55:50.975897+00:00 app[one_worker_and_scheduler.1]: TESTJOB: 11
2017-04-12T14:55:51.282979+00:00 heroku[one_worker_and_scheduler.1]: Stopping all processes with SIGTERM
2017-04-12T14:55:51.290502+00:00 app[one_worker_and_scheduler.1]: buildpack=stunnel at=app-end
2017-04-12T14:55:51.290636+00:00 app[one_worker_and_scheduler.1]: buildpack=stunnel at=exit process=app
2017-04-12T14:55:51.290678+00:00 app[one_worker_and_scheduler.1]: rake aborted!
2017-04-12T14:55:51.290772+00:00 app[one_worker_and_scheduler.1]: SignalException: SIGTERM
2017-04-12T14:55:51.291114+00:00 app[one_worker_and_scheduler.1]: /app/vendor/bundle/ruby/2.3.0/gems/resque-1.27.3/lib/resque/tasks.rb:37:in `join'
2017-04-12T14:55:51.291117+00:00 app[one_worker_and_scheduler.1]: /app/vendor/bundle/ruby/2.3.0/gems/resque-1.27.3/lib/resque/tasks.rb:37:in `block (3 levels) in <top (required)>'
2017-04-12T14:55:51.291118+00:00 app[one_worker_and_scheduler.1]: /app/vendor/bundle/ruby/2.3.0/gems/resque-1.27.3/lib/resque/tasks.rb:37:in `each'
2017-04-12T14:55:51.291118+00:00 app[one_worker_and_scheduler.1]: /app/vendor/bundle/ruby/2.3.0/gems/resque-1.27.3/lib/resque/tasks.rb:37:in `block (2 levels) in <top (required)>'
2017-04-12T14:55:51.291119+00:00 app[one_worker_and_scheduler.1]: /app/vendor/bundle/ruby/2.3.0/gems/bundler-1.13.7/lib/bundler/cli/exec.rb:74:in `load'
2017-04-12T14:55:51.291120+00:00 app[one_worker_and_scheduler.1]: /app/vendor/bundle/ruby/2.3.0/gems/bundler-1.13.7/lib/bundler/cli/exec.rb:74:in `kernel_load'
2017-04-12T14:55:51.291120+00:00 app[one_worker_and_scheduler.1]: /app/vendor/bundle/ruby/2.3.0/gems/bundler-1.13.7/lib/bundler/cli/exec.rb:27:in `run'
2017-04-12T14:55:51.291121+00:00 app[one_worker_and_scheduler.1]: /app/vendor/bundle/ruby/2.3.0/gems/bundler-1.13.7/lib/bundler/cli.rb:332:in `exec'
2017-04-12T14:55:51.291122+00:00 app[one_worker_and_scheduler.1]: /app/vendor/bundle/ruby/2.3.0/gems/bundler-1.13.7/lib/bundler/vendor/thor/lib/thor/command.rb:27:in `run'
2017-04-12T14:55:51.291122+00:00 app[one_worker_and_scheduler.1]: /app/vendor/bundle/ruby/2.3.0/gems/bundler-1.13.7/lib/bundler/vendor/thor/lib/thor/invocation.rb:126:in `invoke_command'
2017-04-12T14:55:51.291123+00:00 app[one_worker_and_scheduler.1]: /app/vendor/bundle/ruby/2.3.0/gems/bundler-1.13.7/lib/bundler/vendor/thor/lib/thor.rb:359:in `dispatch'
2017-04-12T14:55:51.291123+00:00 app[one_worker_and_scheduler.1]: /app/vendor/bundle/ruby/2.3.0/gems/bundler-1.13.7/lib/bundler/cli.rb:20:in `dispatch'
2017-04-12T14:55:51.291124+00:00 app[one_worker_and_scheduler.1]: /app/vendor/bundle/ruby/2.3.0/gems/bundler-1.13.7/lib/bundler/vendor/thor/lib/thor/base.rb:440:in `start'
2017-04-12T14:55:51.291125+00:00 app[one_worker_and_scheduler.1]: /app/vendor/bundle/ruby/2.3.0/gems/bundler-1.13.7/lib/bundler/cli.rb:11:in `start'
2017-04-12T14:55:51.291125+00:00 app[one_worker_and_scheduler.1]: /app/vendor/bundle/ruby/2.3.0/gems/bundler-1.13.7/exe/bundle:34:in `block in <top (required)>'
2017-04-12T14:55:51.291126+00:00 app[one_worker_and_scheduler.1]: /app/vendor/bundle/ruby/2.3.0/gems/bundler-1.13.7/lib/bundler/friendly_errors.rb:100:in `with_friendly_errors'
2017-04-12T14:55:51.291126+00:00 app[one_worker_and_scheduler.1]: /app/vendor/bundle/ruby/2.3.0/gems/bundler-1.13.7/exe/bundle:26:in `<top (required)>'
2017-04-12T14:55:51.291127+00:00 app[one_worker_and_scheduler.1]: /app/bin/bundle:3:in `load'
2017-04-12T14:55:51.291127+00:00 app[one_worker_and_scheduler.1]: /app/bin/bundle:3:in `<main>'
2017-04-12T14:55:51.291131+00:00 app[one_worker_and_scheduler.1]: Tasks: TOP => resque:workers
2017-04-12T14:55:51.291172+00:00 app[one_worker_and_scheduler.1]: (See full trace by running task with --trace)
2017-04-12T14:55:51.291776+00:00 app[one_worker_and_scheduler.1]: buildpack=stunnel at=kill-app pid=20
2017-04-12T14:55:51.292117+00:00 app[one_worker_and_scheduler.1]: bin/start-stunnel: line 49: kill: (-20) - No such process
2017-04-12T14:55:51.292195+00:00 app[one_worker_and_scheduler.1]: buildpack=stunnel at=wait-app pid=20
2017-04-12T14:55:51.292314+00:00 app[one_worker_and_scheduler.1]: buildpack=stunnel at=kill-aux name=stunnel pid=19 signal=SIGINT
2017-04-12T14:55:51.292859+00:00 app[one_worker_and_scheduler.1]: buildpack=stunnel at=stunnel-end
2017-04-12T14:55:51.381566+00:00 heroku[one_worker_and_scheduler.1]: Process exited with status 0
2017-04-12T14:56:00.901580+00:00 heroku[one_worker_and_scheduler.1]: Starting process with command `WORKER_COUNT=1 bin/start-stunnel bin/worker_and_schedule

As you can see, there is no delay between

2017-04-12T14:55:50.725252+00:00 heroku[one_worker_and_scheduler.1]: State changed from up to starting
2017-04-12T14:55:50.975897+00:00 app[one_worker_and_scheduler.1]: TESTJOB: 11
2017-04-12T14:55:51.282979+00:00 heroku[one_worker_and_scheduler.1]: Stopping all processes with SIGTERM

and the job actually being killed completly.

When I read this: https://github.com/resque/resque#signals-on-heroku
It seems that the delay should always be < 10 seconds (The time Heroku send a SIGKILL).

Am I doing something wrong or it does not work?
Happy to give more information, as it's quite critical for a bunch of people out there I believe.

Thank you!

@Erowlin I think I figured out the problem here:

  • On my heroku config term_signal && fork_per_job? == true
  • Because of this state unregister_signal_handlers is run on the child (i.e. forked) process. This modifies the trap logic to throw TermException immediately when it receives a TERM signal.
  • If a TERM signal is ONLY ever sent to the parent process, this works great. The parent catches the signal, waits the right amount of time, and then passes the signal down to the worker
  • However, it looks heroku iterates over ALL processes running and sends TERM immediately. This is definitely the case on cedar and heroku-16. Here's the test application https://gist.github.com/iloveitaly/7d16735ad395ea1399e8d115b4295aa2
  • Since the signal is send to the child process immediately the TERM handler in unregister_signal_handlers is run triggering a TermException which skips all of the logic in new_kill_child

This MonkeyPatch looks to solve the issue for me:

Resque::Worker.class_eval do
  def unregister_signal_handlers
    trap('TERM') do
      trap('TERM') do
        puts "[resque-override] received second term signal from parent, throwing termexception"

        trap('TERM') do
          puts "[resque-override] ignoring > 2 term signal"
        end

        raise Resque::TermException.new("SIGTERM")
      end

      puts "[resque-override] received first term signal from heroku, ignoring"

    end

    trap('INT', 'DEFAULT')

    begin
      trap('QUIT', 'DEFAULT')
      trap('USR1', 'DEFAULT')
      trap('USR2', 'DEFAULT')
    rescue ArgumentError
    end
  end
end

I'm not sure how to make this into a PR without implementing a heroku-specific flag. Most other non-heroku platforms/devops architectures only signal the parent process.

I've asked heroku support if there's a way for them to respect the process hierarchy and only signal the parent process. I'll let you know if I find anything.

Thanks for the info, @iloveitaly. Hopefully heroku can resolve this on their end.

@chrisccerami My hunch is they don't have any options to change how their infrastructure handles this. I'll let you know when I hear back.

Any idea how to best handle this case? I'm thinking a heroku flag would be the only way to cleanly handle this, right?

Thanks for your answer @iloveitaly. It makes sense actually. However, I'm not sure we can do much on our side. Should we try to sync with the Heroku team and see with them what are our option?

@chrisccerami @Erowlin Just heard back from heroku support: there's no way to change how their platform signals processes. What do you guys think about some sort of feature flag to handle this case? Signals don't have any metadata so I don't think there's a way to know where the signal is coming from.

Also, I've updated the above snipped with some improved log statement and a namespace bug fix.

I'm not sure I get your point about the feature flag. Could you provide an example @iloveitaly ?

I believe he means passing in a flag if you're using heroku, so that Resque can handle shutdown the way heroku expects, without interfering with other platforms.

@chrisccerami that's exactly right—it looks like we need a heroku-specific codepath to adjust how signals are handled when the app is hosted on heroku.

Ah got it, it makes sense actually. It also means that we have to implement a specific "logic" within the Resque code to properly handle the Heroku behaviour, or it already exists in the Resque codebase?

-> Update I believe we have to use your monkey patch until it's merged?

-> Update I believe we have to use your monkey patch until it's merged?

That's right!

@chrisccerami I know there aren't platform-specific feature flags right now in the resque codebase. Any opinions on the best way to implement this?

I'll have to put some thought into this (perhaps next week when I have a week off), but I'm open to ideas about how we can implement this in a way that doesn't make the code too messy for the generic case.

Sounds good! I hope you don't spend time on this on your week off though!

What about implementing the heroku specific code in a plugin such as resque-heroku, that would allow us to not mess with the generic case in the main code base.

Just a random thought.

What about implementing the heroku specific code in a plugin such as resque-heroku, that would allow us to not mess with the generic case in the main code base.

I like this a lot! Downside would be having to monkeypatch core heroku. However, if the resque-heroku gem locks to a x.x.x resque version that would eliminate the risk of patching the wrong code, or the logic for that underlying code changing across releases without us knowing this.

Pulling out platform specific flags into a separate gem does seem to make the most the sense to me. I can go ahead and do this unless anyone else has other thoughts!

I'm happy with that solution.

Hey All! I've created a gem to wrap this patch:

https://github.com/iloveitaly/resque-heroku-signals

I'm curious if anyone here has experienced this same issue with Amazon ECS, I'm having a hard time rescuing Resque::TermException to retry the job. so I'm assuming the child process on ECS is also being directly signalled. From ECS documentation when a task is killed it's basically like sending a docker stop which from its documentation says it sends SIGTERM to the root process which should be the parent worker. Would be helpful if anyone knows what the case with Amazon ECS is!

I'm currently debugging the same issue. So far it seems like the TermException, which should get raised in the trap context of the forked child is not passed up to the parent.
It might be related that we also fork another process in our child, but when i remember correctly, back the day it was working.

I was able to reproduce by starting a resque worker inside a docker container and than send a term signal to the container:
docker kill --signal=SIGTERM <container_id>

I'd expect to have a TermException be raise, but it does not. when trying to raise it via a own trap handle

trap('TERM') do
        Resque.logger.info { "RECEIVED TERMSIGNAL" }
        raise TermException.new("SIGTERM")
        Resque.logger.info { "RECEIVED TERMSIGNAL after" }
end

I get the first log entry, but the exception is never raised and, of course, also not the second log entry.

fxn commented

On a superficial read, doesn't the undocumented graceful_term flag (e5a2ee7) make Resque compatible with Heroku without the need of resque-heroku-signals? /cc @iloveitaly

I mean, provided GRACEFUL_TERM is set and TERM_CHILD unset, you'd get @shutdown set, the worker would wait for the job to finish, and the job could handle SIGTERM. Everything killed past 30 seconds. This would be the same shutdown experience you'd get in an arbitrary dyno.

(You would lose Resque's child termination logic.)

fxn commented

I believe the solution implemented by https://github.com/resque/resque-heroku-signals is fundamentally flawed, because it ignores the documented caveat that processes can get SIGTERM multiple times:

Please note that it is currently possible that processes in a dyno that is being shut down may receive multiple SIGTERMs

The assumption in that gem is that the second SIGTERM comes from the worker, but this is not guaranteed. I have seen it happening in tests with Resque workers (you get Resque::TermException before RESQUE_PRE_SHUTDOWN_TIMEOUT seconds), and this is a minimal Ruby script where you can see it:

CleanShot 2023-04-18 at 12 08 12

I am thinking about alternative approaches.