oban-bg/oban

Postgres serialization failure error when fetching jobs using `serializable` transaction isolation level

Closed this issue · 5 comments

Environment

  • Oban Version: 2.18.0
  • PostgreSQL Version: 15.5
  • Erlang/OTP Version: 27.0.1
  • Elixir Version: 1.17.1-otp-27

Current Behavior

Running Oban in a deployed environment with two nodes (non-distributed Erlang) we encountered the following Postgres serialization_failure exception reported by Sentry:

%Postgrex.Error{
   message: nil,
   postgres: %{
     code: :serialization_failure,
     line: "227",
     message: "could not serialize access due to concurrent update",
     file: "nodeLockRows.c",
     unknown: "ERROR",
     severity: "ERROR",
     pg_code: "40001",
     routine: "ExecLockRows"
   },
   connection_id: 25779,
   query: nil
 }

The callstack shows the error originates from the Oban queue producer when calling the Oban.Engines.Basic.fetch_jobs/3 function.

[
  {Ecto.Adapters.SQL, :raise_sql_call_error, 1,
   [file: ~c"lib/ecto/adapters/sql.ex", line: 1054, error_info: %{module: Exception}]},
  {Ecto.Adapters.SQL, :execute, 6, [file: ~c"lib/ecto/adapters/sql.ex", line: 952]},
  {Ecto.Repo.Queryable, :execute, 4, [file: ~c"lib/ecto/repo/queryable.ex", line: 232]},
  {Oban.Engines.Basic, :"-fetch_jobs/3-fun-0-", 4,
   [file: ~c"lib/oban/engines/basic.ex", line: 119]},
  {Ecto.Adapters.SQL, :"-checkout_or_transaction/4-fun-0-", 3,
   [file: ~c"lib/ecto/adapters/sql.ex", line: 1358]},
  {DBConnection, :run_transaction, 4, [file: ~c"lib/db_connection.ex", line: 1756]},
  {Oban.Engine, :"-fetch_jobs/3-fun-0-", 4, [file: ~c"lib/oban/engine.ex", line: 234]},
  {Oban.Engine, :"-with_span/4-fun-0-", 3, [file: ~c"lib/oban/engine.ex", line: 339]},
  {:telemetry, :span, 3, [file: ~c"/app/deps/telemetry/src/telemetry.erl", line: 321]},
  {Oban.Queue.Producer, :start_jobs, 1, [file: ~c"lib/oban/queue/producer.ex", line: 253]},
  {Oban.Queue.Producer, :"-dispatch/1-fun-0-", 2,
   [file: ~c"lib/oban/queue/producer.ex", line: 244]},
  {:telemetry, :span, 3, [file: ~c"/app/deps/telemetry/src/telemetry.erl", line: 321]},
  {Oban.Queue.Producer, :dispatch, 1, [file: ~c"lib/oban/queue/producer.ex", line: 243]},
  {Oban.Queue.Producer, :handle_info, 2, [file: ~c"lib/oban/queue/producer.ex", line: 170]},
  {:gen_server, :try_handle_info, 3, [file: ~c"gen_server.erl", line: 2173]},
  {:gen_server, :handle_msg, 6, [file: ~c"gen_server.erl", line: 2261]},
  {:proc_lib, :init_p_do_apply, 3, [file: ~c"proc_lib.erl", line: 329]}
]

Postgres Configuration

We have set Postgres' transaction isolation level to serializable which is the most strict isolation level.

ALTER DATABASE <database> SET default_transaction_isolation = 'serializable';

The Postgres documentation also advises applications that use serializable transactions are to automatically retry transactions that are rolled back due to the :serialization_failure Postgres error as we have encountered above.

If the Serializable transaction isolation level is used for all writes and for all reads which need a consistent view of the data, no other effort is required to ensure consistency. Software from other environments which is written to use serializable transactions to ensure consistency should “just work” in this regard in PostgreSQL.

When using this technique, it will avoid creating an unnecessary burden for application programmers if the application software goes through a framework which automatically retries transactions which are rolled back with a serialization failure. It may be a good idea to set default_transaction_isolation to serializable. It would also be wise to take some action to ensure that no other transaction isolation level is used, either inadvertently or to subvert integrity checks, through checks of the transaction isolation level in triggers.

See: Enforcing Consistency with Serializable Transactions

Within out application we execute all database updates by using an automatic retry mechanism via a helper function that wraps Ecto's Repo.transaction function with exception handling to detect :serialization_failure and retry the transaction:.

def transaction_with_retries(fun_or_multi, delay \\ 5, retries \\ 10) do
  Repo.transaction(fun_or_multi)
rescue
  ex in Postgrex.Error ->
    # Limit retries to prevent infinite attempts
    if retries <= 0, do: reraise(ex, __STACKTRACE__)

    case ex do
      %Postgrex.Error{postgres: %{code: :serialization_failure, pg_code: "40001"}} ->
        # Retry transaction after a delay
        Process.sleep(delay)
        
        delay = delay + :rand.uniform(delay)
        retries = retries - 1
        
        transaction_with_retries(fun_or_multi, delay, retries)

      %Postgrex.Error{} ->
        reraise(ex, __STACKTRACE__)
    end
end

Doing this pretty much eliminates the Postgres serialization failures.

Expected Behavior

The behaviour we want is for Oban to use a similar retry strategy when executing its internal database queries, such as those in the Oban.Engines.Basic module.

One possible solution would be to implement our own engine module which is a direct copy of the existing Oban.Engines.Basic module but having the Oban.Repo.transaction/2 calls replaced with a retry function such as transaction_with_retries/3 above. The drawback to doing this is there is a lot of important logic within the basic engine module that would need to be kept updated with each Oban release.

Any idea for an alternative fix for our problem, is this something that could be added to the Oban.Repo.transaction function for example?

The behaviour we want is for Oban to use a similar retry strategy when executing its internal database queries, such as those in the Oban.Engines.Basic module.

This is something Pro's Smart engine already does (admittedly, it's not a differentiating feature).

Any idea for an alternative fix for our problem, is this something that could be added to the Oban.Repo.transaction function for example?

Possibly. We'll look into it.

This is something Pro's Smart engine already does (admittedly, it's not a differentiating feature).

Thanks for the quick reply. Upgrading to Oban Pro would likely resolve this issue for us?

Would you accept a Pull Request which implements the retry logic in the Oban.Repo.transaction function?

Upgrading to Oban Pro would likely resolve this issue for us?

Possibly, but the generic backoff period would possibly be too slow. It has a quick backoff mechanism for locks, but not anything specifically for :serialization_failure.

Would you accept a Pull Request which implements the retry logic in the Oban.Repo.transaction function?

Not necessary. I'll take care of it.

@slashdotdash #1134 is up if you'd like to take a look

Looks good to me, thank you for the rapid turnaround. Impressive work!