sorentwo/oban

Oban Pro v1.3.1 Workflows jobs not moving to executing state

Closed this issue · 16 comments

Environment

  • Oban Version: v2.17.2
  • PostgreSQL Version: v15.2
  • Elixir & Erlang/OTP Versions (elixir --version): Elixir 1.16.0 (compiled with Erlang/OTP 26)

Current Behavior

Hi! Upgrading to the latest Oban Pro v1.3.1 and using Oban.Pro.Workers.Workflow with ack_async: false, the jobs stay in the available state and hang there.

Trying the same without ack_async: false, the jobs are moved from the available state to executing state correctly.

I added ack_async: false because I'm using recorded output in the jobs.

Expected Behavior

Having the ability to use Oban.Pro.Workers.Workflow with ack_async: false

vhf commented

We're observing the same thing with Chunk workers.

Do you have global limits or rate limits configured? Will you share more about your configuration?

vhf commented

In our case (please say so if you'd rather have me open another issue):

my_queue: [global_limit: 10, paused: true],
  use Oban.Pro.Workers.Chunk,
    queue: :my_queue,
    size: 75,
    timeout: :timer.seconds(10),
    max_attempts: 10

my_queue is automatically unpaused after startup, what we're seeing is the unpaused queue with hundreds of thousands of jobs accumulating in "available" state, no job going through. Sometimes a burst of jobs goes through, then not a single one for >30min, then another short burst, etc. Single-node setup so global_limit could just as well be local.

@vhf Thanks! That's very helpful.

The original issue mentions that they've explicitly set ack_async: false. Is that true in your case as well?

vhf commented

The original issue mentions that they've explicitly set ack_async: false. Is that true in your case as well?

It is not, all we did was upgrade oban_pro from 1.2.2 to 1.3.0 without any code or config change, ran into this problem, downgraded to 1.2.2 (works as expected), upgraded from 1.2.2 to 1.3.1, ran into this problem, solved it by downgrading to 1.2.2 again.

The original issue was from the combination of a global limit and ack_async: false, not due to workflows.

@vhf Your issue is from the combination of a global queue and the chunk worker. Fixing that one now.

@vhf I may have spoke too soon there—I can't recreate your issue so far. We're also running chunks in a globally limited queue on v1.3.1 and it's working as expected. Please reach out on Slack so I can gather additional details.

Hey @sorentwo, we do use a combination of global limit and ack_async: false - here's a sample config for a queue:

[
      ack_async: false,
      local_limit: 50,
      global_limit: [
        allowed: 10,
        partition: [fields: [:args], keys: [:some_key]]
      ],
      rate_limit: [allowed: 5_000, period: {1, :minute}]
    ]

I'm not sure if there are any alternatives to make the Workflow work with the above config in this case, I tried again and the same behaviour happens where jobs are stalled in the available state.

@omaralsoudanii The issue you encountered is fixed in v1.3.2. Thanks for the report!

Thanks for jumping into this quickly! @sorentwo
I tested it, and the jobs are getting executed now - however, I noticed that the global limit partitioning doesn't work anymore.
In the example below, Oban v1.2 executes 10 jobs at a time, and with Oban v1.3, all the jobs are executing at the same time 🤔

global_limit: [
        allowed: 10,
        partition: [fields: [:args], keys: [:some_key]]
      ],

@omaralsoudanii Is that also with ack_async: false? Side question, what prompted you to run with ack_async: false initially?

@sorentwo Yeah it is also with ack_async: false, this is the full config for the queue :)

[
      ack_async: false,
      local_limit: 50,
      global_limit: [
        allowed: 10,
        partition: [fields: [:args], keys: [:some_key]]
      ],
      rate_limit: [allowed: 5_000, period: {1, :minute}]
    ]

Side question, what prompted you to run with ack_async: false initially?

I'm using the recorded jobs feature in Oban. As soon as the job finishes executing, I retrieve the data via the Worker hooks. This is not possible with the newly introduced async tracking due to the slight lag documented here: https://getoban.pro/docs/pro/1.3.2/Oban.Pro.Engines.Smart.html#module-async-tracking

Edit: Scratch that. The recording is removed before the hook fires. It should be available without pulling it back from the database though.

@sorentwo The issue I'm noticing now is that the global partition limiting doesn't work. In the example I sent with Oban Pro v1.2 the queue executes 10 jobs at the most, right now it's executing all the jobs at the same time with no rate limiting.

@omaralsoudanii v1.3.3 is out extensive testing and overhauling sync acking to force serialized updates.

In addition, there's a new after_process/3 callback so you can get the return value in the hook without fetching. Now you don't need to use ack_async: false 🙂

def after_process(state, job, result) do
  ...
end

Thank you @sorentwo! It works perfectly now without the need to use ack_async: false. Love the addition of the new after_process/3 hook!