oban-bg/oban

Oban Pro Batch Workers stuck in "executing" forever after upgrade from Aurora PostgreSQL 14 -> 16

Closed this issue · 4 comments

Precheck

  • Do a quick search and make sure the bug has not yet been reported
    • a similar bug was reported and fixed, but I believe this is separate (but related)

Environment

  • Oban Version
    • Oban 2.17.12
    • Oban Pro 1.4.10
    • Oban Web 2.10.4
  • PostgreSQL Version
    • Aurora PostgreSQL 16.2 (breaks)
    • Aurora PostgreSQL 14.10 (works)
  • Elixir & Erlang/OTP Versions (elixir --version)
    • Elixir 1.16.1
    • Erlang/OTP 25

Current Behavior

Inserting a batch worker after upgrading to Aurora PSQL 16.2 causes the job to be stuck in "executing" for a prolonged period of time, never completing or failing, and eventually getting stuck in the "scheduled" state. No errors are recorded for the job, the attempt is shown as attempt 1 of 5.

Additionally, after inserting the batch worker, other jobs (non-batch) in that queue will also get stuck in the "available" state, never actually executing.

When looking at error logs, we see repeated instances of the same error:

GenServer {Oban.Registry, {Oban, {:producer, "default"}}} terminating
** (Postgrex.Error) ERROR 42P18 (indeterminate_datatype) could not determine data type of parameter $7

    query: SELECT $1 FROM "public"."oban_jobs" AS o0 WHERE (o0."meta" ? 'batch_id') AND (o0."meta" ->> 'batch_id' = $2) AND (o0."meta" ->> 'callback' IS NULL) AND (o0."state" = $3) UNION ALL (SELECT $4 FROM "public"."oban_jobs" AS o0 WHERE (o0."meta" ? 'batch_id') AND (o0."meta" ->> 'batch_id' = $5) AND (o0."meta" ->> 'callback' IS NULL) AND (o0."state" = $6) LIMIT 1) UNION ALL (SELECT $7 FROM "public"."oban_jobs" AS o0 WHERE (o0."meta" ? 'batch_id') AND (o0."meta" ->> 'batch_id' = $8) AND (o0."meta" ->> 'callback' IS NULL) AND (o0."state" = $9) LIMIT 1) UNION ALL (SELECT $10 FROM "public"."oban_jobs" AS o0 WHERE (o0."meta" ? 'batch_id') AND (o0."meta" ->> 'batch_id' = $11) AND (o0."meta" ->> 'callback' IS NULL) AND (o0."state" = $12) LIMIT 1) UNION ALL (SELECT $13 FROM "public"."oban_jobs" AS o0 WHERE (o0."meta" ? 'batch_id') AND (o0."meta" ->> 'batch_id' = $14) AND (o0."meta" ->> 'callback' IS NULL) AND (o0."state" = $15) LIMIT 1) UNION ALL (SELECT $16 FROM "public"."oban_jobs" AS o0 WHERE (o0."meta" ? 'batch_id') AND (o0."meta" ->> 'batch_id' = $17) AND (o0."meta" ->> 'callback' IS NULL) AND (o0."state" = $18) LIMIT 1) LIMIT 1
    (ecto_sql 3.11.3) lib/ecto/adapters/sql.ex:1054: Ecto.Adapters.SQL.raise_sql_call_error/1
    (ecto_sql 3.11.3) lib/ecto/adapters/sql.ex:952: Ecto.Adapters.SQL.execute/6
    (ecto 3.11.2) lib/ecto/repo/queryable.ex:232: Ecto.Repo.Queryable.execute/4
    (ecto 3.11.2) lib/ecto/repo/queryable.ex:19: Ecto.Repo.Queryable.all/3
    (oban 2.17.12) lib/oban/repo.ex:156: Oban.Repo.with_dynamic_repo/3
    (oban_pro 1.4.10) lib/oban/pro/batcher.ex:113: anonymous fn/3 in Oban.Pro.Batcher.states_for_callbacks/3
    (ecto_sql 3.11.3) lib/ecto/adapters/sql.ex:1358: anonymous fn/3 in Ecto.Adapters.SQL.checkout_or_transaction/4
    (db_connection 2.6.0) lib/db_connection.ex:1020: DBConnection.transaction/3
    (oban 2.17.12) lib/oban/repo.ex:156: Oban.Repo.with_dynamic_repo/3
    (oban_pro 1.4.10) lib/oban/pro/batcher.ex:64: Oban.Pro.Batcher.handle/3
    (elixir 1.16.1) lib/enum.ex:987: Enum."-each/2-lists^foreach/1-0-"/2
    (oban_pro 1.4.10) lib/oban/pro/engines/smart.ex:861: Oban.Pro.Engines.Smart.run_flush_handlers/2
    (ecto 3.11.2) lib/ecto/multi.ex:883: Ecto.Multi.apply_operation/5
    (elixir 1.16.1) lib/enum.ex:2528: Enum."-reduce/3-lists^foldl/2-0-"/3
    (ecto 3.11.2) lib/ecto/multi.ex:856: anonymous fn/5 in Ecto.Multi.apply_operations/5
    (ecto_sql 3.11.3) lib/ecto/adapters/sql.ex:1358: anonymous fn/3 in Ecto.Adapters.SQL.checkout_or_transaction/4
    (db_connection 2.6.0) lib/db_connection.ex:1710: DBConnection.run_transaction/4
    (ecto 3.11.2) lib/ecto/repo/transaction.ex:18: Ecto.Repo.Transaction.transaction/4
    (oban 2.17.12) lib/oban/repo.ex:156: Oban.Repo.with_dynamic_repo/3
    (oban_pro 1.4.10) lib/oban/pro/engines/smart.ex:536: anonymous fn/4 in Oban.Pro.Engines.Smart.fetch_jobs/3

This is a very similar error as this previous post in the Elixir/Oban forums: https://elixirforum.com/t/error-when-enqueuing-oban-jobs-with-unicity-constraint-on-aurora-postgresql-16-1-postgrex-error-error-42p18-indeterminate-datatype-could-not-determine-data-type-of-parameter-7/63996

We previously experienced that other bug after this 14->16 upgrade, raising the error immediately while inserting those jobs. That issue was resolved after the Oban Pro update to 1.4.10, but this issue has popped up after that update separately, and seems to happen in the background.

Expected Behavior

Inserting a batch worker should work, the worker should execute, the queue it was in should not form a logjam, and no errors from Oban internals should be thrown.

Something changed in how PostgreSQL 16.2 treats injected values in a union. This is fixed with explicit casting for all union queries in main, and we'll backport to v1.4

Glad to hear it's been fixed, and likely the whole thing will be improved in 1.5 👍 .

Is there a rough timeline on a 1.4 backport? I've been able to work around this for now, but it's been a tricky issue with batch workers in our dev environment.

@williamge We'll have a new v1.5.0 RC and a v1.4 patch released this week.

:) awesome thanks a lot!