Scheduling thread can die silently

Question

Scheduling thread can die silently

blowfishpro opened this issue 2 years ago · 7 comments

Steps to reproduce:

Install rufus-scheduler version 3.8.1 (the latest version at present)

Run the following script:

require 'rufus-scheduler'

scheduler = Rufus::Scheduler.new
scheduler.every '10s', blocking: true do
  puts 'job triggered'
  t = Thread.current
  Thread.new do
    sleep 1
    t.raise('example exception')
  end
end
scheduler.join

Observed behavior:

The job is triggered only once, then the script hangs forever but the job isn't run again

Desired behavior:

The aborted scheduler thread causes the script to exit
Maybe join raises an exception if the scheduling thread died abnormally?

Workarounds:

Ensure that "example exception" is not raised on the calling thread
Don't use blocking jobs (in the code above only the worker thread will die and the cron will keep firing and spawning new worker threads)

Notes, explanation, and context:
The code that just triggers an exception in the originating thread seems weird but it's a simulation of how the bunny gem (RabbitMQ client) handles some network errors - a new bunny connection creates some listener threads which by default raise an exception in the originating thread when they encounter a network error). This behavior is configurable, but it also seems like the rufus code is a bit brittle in this regard.

In particular, this code - maybe rejoin should be in an ensure to make sure it always gets called at the end of the thread? It would also be nice in this case to know from join that the scheduler thread raised an exception.

As a side note, in our current architecture we run rufus as a dedicated process whose only purpose is to schedule jobs. It seems like the dedicated thread for scheduling jobs is unnecessary in this case? But rufus doesn't provide an interface for scheduling synchronously on the calling thread.

These changes seem relatively straightforward, if there's buy-in I'm happy to submit a pull request with them.

Answer 1 · 2022-05-10T01:50:02.000Z

Hello Talia,

thanks for your issue report.

Have you had a look at #on_error, you might exit your script from there.

As a side note, in our current architecture we run rufus as a dedicated process whose only purpose is to schedule jobs. It seems like the dedicated thread for scheduling jobs is unnecessary in this case?

If I were in your shoes, I'd probably reduce the script to something like:

require 'fugit'

#CRON = Fugit.parse('every 10s')
CRON = Fugit.parse('*/2 * * * *')

FREQUENCY = [ CRON.rough_frequency / 2, 60 ].min

nt = CRON.next_time

loop do
  n = Time.now
  if n > nt
    p [ n, :doing_the_job ]
    nt = CRON.next_time
  else
    puts '.'
  end
  sleep FREQUENCY
end

Answer 2 · 2022-05-10T02:38:44.000Z

on_error seems useful in general but it won't work in this case - the exception is oftenraised while the scheduler thread is sleeping, not while it is currently invoking a job.

I provided a simplified example for reproduction but in reality our scheduler process is handling potentially dozens of cron schedules at once, which is why we're using rufus in the first place.

Answer 3 · 2022-05-10T02:49:54.000Z

What version of Ruby are you using?

on ruby 2.6.6p146 (2020-03-31 revision 67876) [x86_64-openbsd6.7]

the script does exit with return code 1 and outputs:

job triggered
#<Thread:0x00000ce4ae2ee4c0@/home/jmettraux/w/rufus-scheduler/lib/rufus/scheduler.rb:630 run> terminated with exception (report_on_exception is true):
Traceback (most recent call last):
        1: from /home/jmettraux/w/rufus-scheduler/lib/rufus/scheduler.rb:638:in `block in start'
/home/jmettraux/w/rufus-scheduler/lib/rufus/scheduler.rb:638:in `sleep': example exception (RuntimeError)
Traceback (most recent call last):
        1: from /home/jmettraux/w/rufus-scheduler/lib/rufus/scheduler.rb:638:in `block in start'
/home/jmettraux/w/rufus-scheduler/lib/rufus/scheduler.rb:638:in `sleep': example exception (RuntimeError)
        3: from thalia.rb:13:in `<main>'
        2: from /home/jmettraux/w/rufus-scheduler/lib/rufus/scheduler.rb:148:in `join'
        1: from /home/jmettraux/w/rufus-scheduler/lib/rufus/scheduler.rb:599:in `no_time_limit_join'
/home/jmettraux/w/rufus-scheduler/lib/rufus/scheduler.rb:599:in `pop': No live threads left. Deadlock? (fatal)
1 threads, 1 sleeps current:0x00000ce48b8c1c00 main thread:0x00000ce4b47d4800
* #<Thread:0x00000ce4a77cc638 sleep_forever>
   rb_thread_t:0x00000ce4b47d4800 native:0x00000ce51369f878 int:0
   /home/jmettraux/w/rufus-scheduler/lib/rufus/scheduler.rb:599:in `pop'
   /home/jmettraux/w/rufus-scheduler/lib/rufus/scheduler.rb:599:in `no_time_limit_join'
   /home/jmettraux/w/rufus-scheduler/lib/rufus/scheduler.rb:148:in `join'
   thalia.rb:13:in `<main>'

Answer 4 · 2022-05-10T04:18:29.000Z

Ah no worries, I see the symptoms you describe on ruby 3.0.0p0 (2020-12-25 revision 95aff21468) [x86_64-openbsd6.8].

Answer 5 · 2022-05-10T05:03:44.000Z

Please tell me if the above change suits your need.

It seems ok when seen from:

require 'rufus-scheduler'

scheduler = Rufus::Scheduler.new
scheduler.every '10s', blocking: true do
  puts 'job triggered'
  t = Thread.current
  Thread.new do
    sleep 1
    t.raise('example exception')
  end
end
def scheduler.on_error(x, y)
  p [ '>>>>', x, y ]
  exit 1
end
scheduler.join

Answer 6 · 2022-05-10T22:41:47.000Z

The ruby version thing is an interesting find. I think in practice we have more threads running so that exception is never hit regardless though (ActiveRecord seems to spawn a thread, Bunny spawns a couple). I added the following to the top of the script and it's now consistent: Thread.new { sleep }

The change in 3fe5a02 looks reasonable. Having the whole thing crash in this case would make sense too but I'm not too particular between the two.

Answer 7 · 2022-07-07T21:46:11.000Z

Closing. Work moved to gh-337.