oban-bg/oban

Single DynamicCron with missing worker will crash GenServer and hang all DynamicCrons

Closed this issue · 1 comments

Environment

  • Oban Version - We saw the problem in 1.4.7 and 1.4.8. We were previously using 1.3.5 with no issues.
  • PostgreSQL Version
  • Elixir & Erlang/OTP Versions (elixir --version)

Erlang/OTP 26 [erts-14.2.1] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [jit]
Elixir 1.16.0 (compiled with Erlang/OTP 24)

Current Behavior

With Oban Pro 1.3.5: We had a DynamicCron using a worker module. At some point the worker module was deleted as we did not need it anymore. Nobody thought of deleting the DynamicCron job that used that worker, however everything kept working fine.

When we bumped Oban Pro to 1.4.7, all our DynamicCrons stopped running. Our logs show the following error, once a minute:

GenServer {Oban.Registry, {Oban, {:plugin, Oban.Pro.Plugins.DynamicCron}}} terminating
** (MatchError) no match of right hand side value: {:error, %RuntimeError{message: "unknown worker: MyDeletedWorker"}}
    (oban_pro 1.4.8) lib/oban/pro/plugins/dynamic_cron.ex:643: anonymous fn/1 in Oban.Pro.Plugins.DynamicCron.reload_entries/1
    (elixir 1.16.0) lib/enum.ex:1700: Enum."-map/2-lists^map/1-1-"/2
    (elixir 1.16.0) lib/enum.ex:1700: Enum."-map/2-lists^map/1-1-"/2
    (oban_pro 1.4.8) lib/oban/pro/plugins/dynamic_cron.ex:624: anonymous fn/2 in Oban.Pro.Plugins.DynamicCron.reload_and_insert/1

It looks like the Supervisor kept trying to restart the plugin, but every time it crashes on startup due to the missing worker module.

Deleting the Cron from the database solved the issue.

Expected Behavior

A single bad job should not break the entire cron system.
The old behaviour from 1.3.5 was better; even better would be to log a warning and display in the Oban dashboard something about that broken Cron.

It would also be useful to have a guide on how to deal with changes to workers, maybe even a migration helper to remove them from the cron table.

It would also be useful to have a guide on how to deal with changes to workers, maybe even a migration helper to remove them from the cron table.

A guide is a gread idea, because it can be tricky to handle cleanly.

A migration to remove them from the cron table probably wouldn't work though. Not all nodes have a worker available, and migrations are typically ran before the rest of the application has booted, so there's no reliable way to tell what's missing.

As of v1.4 you can use sync_mode: :automatic to get automatic crontab syncing.