Potentially bug in handling DB errors in `Oban.Peers.Postgres`

Question

Potentially bug in handling DB errors in `Oban.Peers.Postgres`

Closed this issue 3 months ago · 2 comments

vtm9 commented 5 months ago

Precheck

Hello, thank you a lot for so nice library!

Environment

Oban Version 2.15.3
PostgreSQL 14.10
Elixir 1.16.0 (compiled with Erlang/OTP 25)

Current Behavior

I encountered this error multiple times. During a database downtime, after the database came back to life, Oban wasn't started.

** (Sentry.CrashError ** (exit) exited in: GenServer.call(#PID<0.31587.50>, :leader?, 5000)
    ** (EXIT) an exception was raised:
        ** (DBConnection.ConnectionError) connection not available and request was dropped from queue after 1204ms. This means requests are coming in and your connection pool cannot serve them fast enough. You can address this by:

  1. Ensuring your database is available and that you can connect to it
  2. Tracking down slow queries and making sure they are running fast enough
  3. Increasing the pool_size (although this increases resource consumption)
  4. Allowing requests to wait longer by increasing :queue_target and :queue_interval

See DBConnection.start_link/2 for more information

            (db_connection 2.6.0) lib/db_connection.ex:1059: DBConnection.transaction/3
            (oban 2.15.1) lib/oban/peers/postgres.ex:94: anonymous fn/2 in Oban.Peers.Postgres.handle_info/2
            (telemetry 1.2.1) /opt/app/deps/telemetry/src/telemetry.erl:321: :telemetry.span/3
            (oban 2.15.1) lib/oban/peers/postgres.ex:92: Oban.Peers.Postgres.handle_info/2
            (stdlib 4.3.1.3) gen_server.erl:1123: :gen_server.try_dispatch/4
            (stdlib 4.3.1.3) gen_server.erl:865: :gen_server.loop/7
            (stdlib 4.3.1.3) proc_lib.erl:240: :proc_lib.init_p_do_apply/3)
    (elixir 1.16.0) lib/gen_server.ex:1114: GenServer.call/3
    (oban 2.15.1) lib/oban/peer.ex:99: Oban.Peer.leader?/2
    (oban 2.15.1) lib/oban/stager.ex:101: Oban.Stager.check_leadership_and_stage/1
    (oban 2.15.1) lib/oban/stager.ex:75: anonymous fn/2 in Oban.Stager.handle_info/2
    (telemetry 1.2.1) /opt/app/deps/telemetry/src/telemetry.erl:321: :telemetry.span/3
    (oban 2.15.1) lib/oban/stager.ex:74: Oban.Stager.handle_info/2

Expected Behavior

Probably handle DBConnection.ConnectionError in Peer,
cause now it handles only Postgrex.Error in rescue

  rescue
    error in [Postgrex.Error] ->
      if error.postgres.code == :undefined_table do
        Logger.warning("""
        The `oban_peers` table is undefined and leadership is disabled.

        Run migrations up to v11 to restore peer leadership. In the meantime, distributed plugins
        (e.g. Cron, Pruner, Stager) will not run on any nodes.
        """)
      end

      {:noreply, schedule_election(%{state | leader?: false})}

Answer 1 · 2024-02-05T16:10:56.000Z

also there might be a global rescue with any error and log+reraise?

rescue e ->
  Logger.warning("Unexpected error in the Oban.Peer:  inspect(e)")
  reraise

Answer 2 · 2024-02-06T02:49:59.000Z

The Postgres peer only catches the undefined table error currently to provide an upgrade note. Catching all errors could mask actual connectivity issues and trigger a situation with multiple leaders.

Rescuing more errors or catching exits requires some thought.