Race condition when inserting unique jobs
bamorim opened this issue · 3 comments
Precheck
- Do a quick search and make sure the bug has not yet been reported
- For support, favor using the Elixir Forum, Slack, IRC, etc.
- Be friendly and polite!
While this originated from a support conversation on my day job, this is slightly unrelated and I also didn't found anything on the docs about this caveat nor any issues about it.
Environment
- Oban Version:
2.17.10
- PostgreSQL Version:
16
- Elixir & Erlang/OTP Versions (
elixir --version
):Elixir 1.16.2 (compiled with Erlang/OTP 26)
Current Behavior
If I run the insert_many_unique
function below
defmodule UniqueObanIssue.Worker do
use Oban.Worker
alias UniqueObanIssue.Repo
@impl Oban.Worker
def perform(%Oban.Job{args: %{"name" => name}}) do
IO.puts("Hello #{name}")
:ok
end
def insert_many_unique(name \\ "Bernardo") do
1..5
|> Task.async_stream(
fn num ->
if num == 5 do
Process.sleep(500)
end
Repo.transaction(fn ->
result = insert_unique(name)
Process.sleep(2000)
if num != 5 do
Repo.rollback(result)
else
result
end
end)
end,
max_concurrency: 5
)
|> Enum.to_list()
end
def insert_unique(name \\ "Bernardo") do
UniqueObanIssue.Worker.new(
%{name: name},
unique: [states: [:scheduled, :available]]
)
|> Oban.insert!()
end
end
No jobs are inserted or executed.
Expected Behavior
The last job should have been inserted and executed.
Extra Information
While reading the code for the basic engine I saw that when the advisory lock is locked we just return Changeset.apply_changes(changeset, :insert)
and suppress the error.
So this means that if an earlier transaction acquired the lock but eventually rollbacks for any reason, a transaction succeed but that tried to acquire the lock later will be "successful" but no jobs would be inserted because we just return the changeset with applied changes.
A workaround at the application code level would be to check on the job.id
and either retry or bubble up the error. However, I didn't found any info about that on the docs either, so an unaware user would believe everything went well but in fact no job was scheduled.
I created a repo reproducing this behavior at https://github.com/bamorim/unique_oban_issue/tree/main
Thanks! ❤️
Nice catch! Jobs have a conflict?
field that's supposed to be true
in this situation. I've pushed a patch to fix this so the rollback discrepancy you noted can be detected.
But what one should do then? If conflict?: true one would expect to be a job scheduled already, but it might never get inserted. Retrying might cause an infinite number of retries because conflict? Could be caused by a legit inserted and committed job. Maybe we need a different flag for catching that case?
Maybe we need a different flag for catching that case?
I'm hesitant to add another flag to the struct to handle an advisory lock conflict, but it would at least make it clear that the conflict wasn't necessarily because another job existed. However, you'd still need to be selective about retrying to avoid infinite retries from separate connections.