jmettraux/rufus-scheduler

Potential bug in handling threads on failure

cromega opened this issue · 6 comments

Hello,

I'm trying to write a distributed lock using Redis so we can run multiple instances of Rufus. I have a working implementation, however, during testing out what happens when Redis is unreachable, I bumped into some weird errors that may or may not be caused by my naive code.

Ruby: 2.7.5p203
Rufus: 3.8.1

The most basic way to reproduce what I'm talking about is this:

require 'rufus-scheduler'
require 'redis'

class RedisLock
  def initialize(redis_url)
    @redis = Redis.new(redis_url: redis_url)
    @id = `hostname`.strip
  end

  def lock
    have_lock? || new_lock
  end

  def unlock
    @redis.del('rufus-lock-test')
  end

  def confirm_lock
    have_lock?
  end

  private

  def have_lock?
    @redis.get('rufus-lock-test') == @id
  end

  def new_lock
    !!@redis.set('rufus-lock-test', @id, nx: true, ex: 5)
  end
end

redis_url = 'redis://localhost:6379' # Redis is NOT running here
rufus = Rufus::Scheduler.new(trigger_lock: RedisLock.new(redis_url))
rufus.cron '* * * * *' do
  puts "Job is running"
end

Signal.trap('INT') { rufus.shutdown(:wait) }
Signal.trap('TERM') { rufus.shutdown(:wait) }

rufus.join

Running rufus, making sure redis is unreachable on redis_url, I get a huuuge stack trace when the job triggers with this on the bottom:

	3: from config/rufustest.rb:42:in `<main>'
	2: from /home/cromega/.rbenv/versions/2.7.5/lib/ruby/gems/2.7.0/gems/rufus-scheduler-3.8.1/lib/rufus/scheduler.rb:148:in `join'
	1: from /home/cromega/.rbenv/versions/2.7.5/lib/ruby/gems/2.7.0/gems/rufus-scheduler-3.8.1/lib/rufus/scheduler.rb:599:in `no_time_limit_join'
/home/cromega/.rbenv/versions/2.7.5/lib/ruby/gems/2.7.0/gems/rufus-scheduler-3.8.1/lib/rufus/scheduler.rb:599:in `pop':
No live threads left. Deadlock? (fatal)
1 threads, 1 sleeps current:0x000055ad0bae4c00 main thread:0x000055ad0b099ff0
* #<Thread:0x000055ad0b0c6d78 sleep_forever>
   rb_thread_t:0x000055ad0b099ff0 native:0x00007fcfa8fceec0 int:0
   /home/cromega/.rbenv/versions/2.7.5/lib/ruby/gems/2.7.0/gems/rufus-scheduler-3.8.1/lib/rufus/scheduler.rb:599:in `pop'
   /home/cromega/.rbenv/versions/2.7.5/lib/ruby/gems/2.7.0/gems/rufus-scheduler-3.8.1/lib/rufus/scheduler.rb:599:in `no_time_limit_join'
   /home/cromega/.rbenv/versions/2.7.5/lib/ruby/gems/2.7.0/gems/rufus-scheduler-3.8.1/lib/rufus/scheduler.rb:148:in `join'
   config/rufustest.rb:42:in `<main>'

Also, I wasn't sure what's the best way to gracefully handle errors in RedisLock. Ultimately, I somehow have to let Rufus know that something fatal is going on: I either let exceptions bubble up into Rufus or I trigger shutdown from the lock. Either way I get weird errors and no matter what I try, I always get a deadlock.

It's sort of tangential, calling shutdown from Signal.trap seems to be a problematic way of going about it as lots of things fail and may even raise errors when called from within a Signal.trap context. Redis calls my blow up with ThreadError, logging may fail, etc. However, I don't really know of a better way to trigger a shutdown on a signal.

Hello,

thanks for reporting that.

I've made a change to the extended error handling done for gh-335.

You can check it by adapting your Gemfile:

source 'https://rubygems.org'

gem 'redis'

#gem 'rufus-scheduler'
#gem 'rufus-scheduler', path: '../../w/rufus-scheduler'
gem 'rufus-scheduler', git: 'https://github.com/jmettraux/rufus-scheduler'

Please tell me if that suits you, and I'll release. Best regards.

@jmettraux thanks for the quick response!

A few things:

First off, the deadlock is gone when a Redis op fails.
However, Rufus keeps running even though the lock raises an exception. I expected Rufus to die with the "error did not occur in a job" message you have just added.

Also, I accidentally left a typo in the on_error callback, an exception raised there causes the same deadlock as pasted above,

However, Rufus keeps running even though the lock raises an exception. I expected Rufus to die with the "error did not occur in a job" message you have just added.

I would prefer you, the final user, to kill the scheduler / exit the Ruby process in that case.

Also, I accidentally left a typo in the on_error callback, an exception raised there causes the same deadlock as pasted above,

I could answer that it's your responsibility to provide rufus-scheduler with a tested and correct on_error custom callback, but granted, a bit of handholding would save some of the final user time. Let me think about it.

@jmettraux making progress here.

I decided to trigger shutdown from within the lock when Redis is unreachable (the inspiration was the ZookeptScheduler in the readme).

It works fine for the most part, except when there is an on_error callback that handles all errors (not just errors that come with a job object) then I see this before Rufus quits:

F, [2022-07-11T11:13:00.223421 #14654] FATAL -- : Failed to send GET command to Redis: Error connecting to Redis on localhost:6379 (Errno::ECONNREFUSED). Rufus is quitting # this is logged in the lock
E, [2022-07-11T11:13:00.223956 #14654] ERROR -- : Target thread must not be current thread (ThreadError) # this is logged in on_error
/home/cromega/.rbenv/versions/2.7.5/lib/ruby/gems/2.7.0/bundler/gems/rufus-scheduler-e7dc394914ac/lib/rufus/scheduler.rb:537:in `join'
/home/cromega/.rbenv/versions/2.7.5/lib/ruby/gems/2.7.0/bundler/gems/rufus-scheduler-e7dc394914ac/lib/rufus/scheduler.rb:537:in `shutdown'
[...]

Hello,

that is good. I've added a check to prevent the join error you reported. Thanks for that.

I plan to release 3.8.2 tomorrow morning. I'll close this issue at that point if there is nothing else.

Best regards!

3.8.2 released. Thanks again.