sorentwo/oban

Potentially bug in handling DB errors in `Oban.Peers.Postgres`

Closed this issue · 2 comments

vtm9 commented

Precheck

Hello, thank you a lot for so nice library!

Environment

  • Oban Version 2.15.3
  • PostgreSQL 14.10
  • Elixir 1.16.0 (compiled with Erlang/OTP 25)

Current Behavior

I encountered this error multiple times. During a database downtime, after the database came back to life, Oban wasn't started.

** (Sentry.CrashError ** (exit) exited in: GenServer.call(#PID<0.31587.50>, :leader?, 5000)
    ** (EXIT) an exception was raised:
        ** (DBConnection.ConnectionError) connection not available and request was dropped from queue after 1204ms. This means requests are coming in and your connection pool cannot serve them fast enough. You can address this by:

  1. Ensuring your database is available and that you can connect to it
  2. Tracking down slow queries and making sure they are running fast enough
  3. Increasing the pool_size (although this increases resource consumption)
  4. Allowing requests to wait longer by increasing :queue_target and :queue_interval

See DBConnection.start_link/2 for more information

            (db_connection 2.6.0) lib/db_connection.ex:1059: DBConnection.transaction/3
            (oban 2.15.1) lib/oban/peers/postgres.ex:94: anonymous fn/2 in Oban.Peers.Postgres.handle_info/2
            (telemetry 1.2.1) /opt/app/deps/telemetry/src/telemetry.erl:321: :telemetry.span/3
            (oban 2.15.1) lib/oban/peers/postgres.ex:92: Oban.Peers.Postgres.handle_info/2
            (stdlib 4.3.1.3) gen_server.erl:1123: :gen_server.try_dispatch/4
            (stdlib 4.3.1.3) gen_server.erl:865: :gen_server.loop/7
            (stdlib 4.3.1.3) proc_lib.erl:240: :proc_lib.init_p_do_apply/3)
    (elixir 1.16.0) lib/gen_server.ex:1114: GenServer.call/3
    (oban 2.15.1) lib/oban/peer.ex:99: Oban.Peer.leader?/2
    (oban 2.15.1) lib/oban/stager.ex:101: Oban.Stager.check_leadership_and_stage/1
    (oban 2.15.1) lib/oban/stager.ex:75: anonymous fn/2 in Oban.Stager.handle_info/2
    (telemetry 1.2.1) /opt/app/deps/telemetry/src/telemetry.erl:321: :telemetry.span/3
    (oban 2.15.1) lib/oban/stager.ex:74: Oban.Stager.handle_info/2

Expected Behavior

Probably handle DBConnection.ConnectionError in Peer,
cause now it handles only Postgrex.Error in rescue

  rescue
    error in [Postgrex.Error] ->
      if error.postgres.code == :undefined_table do
        Logger.warning("""
        The `oban_peers` table is undefined and leadership is disabled.

        Run migrations up to v11 to restore peer leadership. In the meantime, distributed plugins
        (e.g. Cron, Pruner, Stager) will not run on any nodes.
        """)
      end

      {:noreply, schedule_election(%{state | leader?: false})}
vtm9 commented

also there might be a global rescue with any error and log+reraise?

rescue e ->
  Logger.warning("Unexpected error in the Oban.Peer:  inspect(e)")
  reraise

The Postgres peer only catches the undefined table error currently to provide an upgrade note. Catching all errors could mask actual connectivity issues and trigger a situation with multiple leaders.

Rescuing more errors or catching exits requires some thought.