sorentwo/oban

Race condition when inserting unique jobs

Closed this issue · 3 comments

Precheck

  • Do a quick search and make sure the bug has not yet been reported
  • For support, favor using the Elixir Forum, Slack, IRC, etc.
  • Be friendly and polite!

While this originated from a support conversation on my day job, this is slightly unrelated and I also didn't found anything on the docs about this caveat nor any issues about it.

Environment

  • Oban Version: 2.17.10
  • PostgreSQL Version: 16
  • Elixir & Erlang/OTP Versions (elixir --version): Elixir 1.16.2 (compiled with Erlang/OTP 26)

Current Behavior

If I run the insert_many_unique function below

defmodule UniqueObanIssue.Worker do
  use Oban.Worker

  alias UniqueObanIssue.Repo

  @impl Oban.Worker
  def perform(%Oban.Job{args: %{"name" => name}}) do
    IO.puts("Hello #{name}")
    :ok
  end

  def insert_many_unique(name \\ "Bernardo") do
    1..5
    |> Task.async_stream(
      fn num ->
        if num == 5 do
          Process.sleep(500)
        end

        Repo.transaction(fn ->
          result = insert_unique(name)
          Process.sleep(2000)

          if num != 5 do
            Repo.rollback(result)
          else
            result
          end
        end)
      end,
      max_concurrency: 5
    )
    |> Enum.to_list()
  end

  def insert_unique(name \\ "Bernardo") do
    UniqueObanIssue.Worker.new(
      %{name: name},
      unique: [states: [:scheduled, :available]]
    )
    |> Oban.insert!()
  end
end

No jobs are inserted or executed.

Expected Behavior

The last job should have been inserted and executed.

Extra Information

While reading the code for the basic engine I saw that when the advisory lock is locked we just return Changeset.apply_changes(changeset, :insert) and suppress the error.

So this means that if an earlier transaction acquired the lock but eventually rollbacks for any reason, a transaction succeed but that tried to acquire the lock later will be "successful" but no jobs would be inserted because we just return the changeset with applied changes.

A workaround at the application code level would be to check on the job.id and either retry or bubble up the error. However, I didn't found any info about that on the docs either, so an unaware user would believe everything went well but in fact no job was scheduled.

I created a repo reproducing this behavior at https://github.com/bamorim/unique_oban_issue/tree/main

Thanks! ❤️

Nice catch! Jobs have a conflict? field that's supposed to be true in this situation. I've pushed a patch to fix this so the rollback discrepancy you noted can be detected.

But what one should do then? If conflict?: true one would expect to be a job scheduled already, but it might never get inserted. Retrying might cause an infinite number of retries because conflict? Could be caused by a legit inserted and committed job. Maybe we need a different flag for catching that case?

Maybe we need a different flag for catching that case?

I'm hesitant to add another flag to the struct to handle an advisory lock conflict, but it would at least make it clear that the conflict wasn't necessarily because another job existed. However, you'd still need to be selective about retrying to avoid infinite retries from separate connections.