Postgres serialization failure error when fetching jobs using `serializable` transaction isolation level
Closed this issue · 5 comments
Environment
- Oban Version:
2.18.0
- PostgreSQL Version:
15.5
- Erlang/OTP Version:
27.0.1
- Elixir Version:
1.17.1-otp-27
Current Behavior
Running Oban in a deployed environment with two nodes (non-distributed Erlang) we encountered the following Postgres serialization_failure
exception reported by Sentry:
%Postgrex.Error{
message: nil,
postgres: %{
code: :serialization_failure,
line: "227",
message: "could not serialize access due to concurrent update",
file: "nodeLockRows.c",
unknown: "ERROR",
severity: "ERROR",
pg_code: "40001",
routine: "ExecLockRows"
},
connection_id: 25779,
query: nil
}
The callstack shows the error originates from the Oban queue producer when calling the Oban.Engines.Basic.fetch_jobs/3
function.
[
{Ecto.Adapters.SQL, :raise_sql_call_error, 1,
[file: ~c"lib/ecto/adapters/sql.ex", line: 1054, error_info: %{module: Exception}]},
{Ecto.Adapters.SQL, :execute, 6, [file: ~c"lib/ecto/adapters/sql.ex", line: 952]},
{Ecto.Repo.Queryable, :execute, 4, [file: ~c"lib/ecto/repo/queryable.ex", line: 232]},
{Oban.Engines.Basic, :"-fetch_jobs/3-fun-0-", 4,
[file: ~c"lib/oban/engines/basic.ex", line: 119]},
{Ecto.Adapters.SQL, :"-checkout_or_transaction/4-fun-0-", 3,
[file: ~c"lib/ecto/adapters/sql.ex", line: 1358]},
{DBConnection, :run_transaction, 4, [file: ~c"lib/db_connection.ex", line: 1756]},
{Oban.Engine, :"-fetch_jobs/3-fun-0-", 4, [file: ~c"lib/oban/engine.ex", line: 234]},
{Oban.Engine, :"-with_span/4-fun-0-", 3, [file: ~c"lib/oban/engine.ex", line: 339]},
{:telemetry, :span, 3, [file: ~c"/app/deps/telemetry/src/telemetry.erl", line: 321]},
{Oban.Queue.Producer, :start_jobs, 1, [file: ~c"lib/oban/queue/producer.ex", line: 253]},
{Oban.Queue.Producer, :"-dispatch/1-fun-0-", 2,
[file: ~c"lib/oban/queue/producer.ex", line: 244]},
{:telemetry, :span, 3, [file: ~c"/app/deps/telemetry/src/telemetry.erl", line: 321]},
{Oban.Queue.Producer, :dispatch, 1, [file: ~c"lib/oban/queue/producer.ex", line: 243]},
{Oban.Queue.Producer, :handle_info, 2, [file: ~c"lib/oban/queue/producer.ex", line: 170]},
{:gen_server, :try_handle_info, 3, [file: ~c"gen_server.erl", line: 2173]},
{:gen_server, :handle_msg, 6, [file: ~c"gen_server.erl", line: 2261]},
{:proc_lib, :init_p_do_apply, 3, [file: ~c"proc_lib.erl", line: 329]}
]
Postgres Configuration
We have set Postgres' transaction isolation level to serializable which is the most strict isolation level.
ALTER DATABASE <database> SET default_transaction_isolation = 'serializable';
The Postgres documentation also advises applications that use serializable transactions are to automatically retry transactions that are rolled back due to the :serialization_failure
Postgres error as we have encountered above.
If the Serializable transaction isolation level is used for all writes and for all reads which need a consistent view of the data, no other effort is required to ensure consistency. Software from other environments which is written to use serializable transactions to ensure consistency should “just work” in this regard in PostgreSQL.
When using this technique, it will avoid creating an unnecessary burden for application programmers if the application software goes through a framework which automatically retries transactions which are rolled back with a serialization failure. It may be a good idea to set default_transaction_isolation to serializable. It would also be wise to take some action to ensure that no other transaction isolation level is used, either inadvertently or to subvert integrity checks, through checks of the transaction isolation level in triggers.
Within out application we execute all database updates by using an automatic retry mechanism via a helper function that wraps Ecto's Repo.transaction
function with exception handling to detect :serialization_failure
and retry the transaction:.
def transaction_with_retries(fun_or_multi, delay \\ 5, retries \\ 10) do
Repo.transaction(fun_or_multi)
rescue
ex in Postgrex.Error ->
# Limit retries to prevent infinite attempts
if retries <= 0, do: reraise(ex, __STACKTRACE__)
case ex do
%Postgrex.Error{postgres: %{code: :serialization_failure, pg_code: "40001"}} ->
# Retry transaction after a delay
Process.sleep(delay)
delay = delay + :rand.uniform(delay)
retries = retries - 1
transaction_with_retries(fun_or_multi, delay, retries)
%Postgrex.Error{} ->
reraise(ex, __STACKTRACE__)
end
end
Doing this pretty much eliminates the Postgres serialization failures.
Expected Behavior
The behaviour we want is for Oban to use a similar retry strategy when executing its internal database queries, such as those in the Oban.Engines.Basic
module.
One possible solution would be to implement our own engine module which is a direct copy of the existing Oban.Engines.Basic
module but having the Oban.Repo.transaction/2
calls replaced with a retry function such as transaction_with_retries/3
above. The drawback to doing this is there is a lot of important logic within the basic engine module that would need to be kept updated with each Oban release.
Any idea for an alternative fix for our problem, is this something that could be added to the Oban.Repo.transaction
function for example?
The behaviour we want is for Oban to use a similar retry strategy when executing its internal database queries, such as those in the Oban.Engines.Basic module.
This is something Pro's Smart engine already does (admittedly, it's not a differentiating feature).
Any idea for an alternative fix for our problem, is this something that could be added to the Oban.Repo.transaction function for example?
Possibly. We'll look into it.
This is something Pro's Smart engine already does (admittedly, it's not a differentiating feature).
Thanks for the quick reply. Upgrading to Oban Pro would likely resolve this issue for us?
Would you accept a Pull Request which implements the retry logic in the Oban.Repo.transaction
function?
Upgrading to Oban Pro would likely resolve this issue for us?
Possibly, but the generic backoff period would possibly be too slow. It has a quick backoff mechanism for locks, but not anything specifically for :serialization_failure
.
Would you accept a Pull Request which implements the retry logic in the Oban.Repo.transaction function?
Not necessary. I'll take care of it.
@slashdotdash #1134 is up if you'd like to take a look
Looks good to me, thank you for the rapid turnaround. Impressive work!