sorentwo/oban

Seeing High CPU Usage with Oban Table Partitioned

Closed this issue · 12 comments

Precheck

We see high CPU usage after 24h using table partition, which not seens with the non-partitioned way.

image

^ AWS RDS Chart

image

^ AWS Chart

Configs:

config :foo, Oban,
  prefix: "foo",
  engine: Oban.Pro.Engines.Smart,
  repo: Foo.Repo,
  plugins: [
    Oban.Pro.Plugins.DynamicLifeline,
    {
      Oban.Pro.Plugins.DynamicPartitioner,
      # updated all retentions to 1 days (two hours ago). The "cancelled" and "discarded" were 7.
      retention: [completed: 1, cancelled: 1, discarded: 1], buffer: 2, schedule: "0 11 * * *"
    },
    # we just removed (2 hours ago in this writting) Reindexer for partitioned table, assuming it is not needed any more?
    # { Oban.Plugins.Reindexer, schedule: "0 9 * * *"},
    {
      Oban.Plugins.Cron,
      crontab: [
        ... // skipped some lines
      ]
    }
  ],
  queues: [default: 50, cronjobs: 1]

Tables:

foo=> \dt foo.*
                         List of relations
 Schema |             Name             |       Type        | Owner
--------+------------------------------+-------------------+-------
 foo    | oban_jobs                    | partitioned table | admin
 foo    | oban_jobs_available          | table             | admin
 foo    | oban_jobs_cancelled          | partitioned table | admin
 foo    | oban_jobs_cancelled_20240207 | table             | admin
 foo    | oban_jobs_cancelled_20240208 | table             | admin
 foo    | oban_jobs_cancelled_20240209 | table             | admin
 foo    | oban_jobs_cancelled_20240210 | table             | admin
 foo    | oban_jobs_completed          | partitioned table | admin
 foo    | oban_jobs_completed_20240207 | table             | admin
 foo    | oban_jobs_completed_20240208 | table             | admin
 foo    | oban_jobs_completed_20240209 | table             | admin
 foo    | oban_jobs_completed_20240210 | table             | admin
 foo    | oban_jobs_discarded          | partitioned table | admin
 foo    | oban_jobs_discarded_20240207 | table             | admin
 foo    | oban_jobs_discarded_20240208 | table             | admin
 foo    | oban_jobs_discarded_20240209 | table             | admin
 foo    | oban_jobs_discarded_20240210 | table             | admin
 foo    | oban_jobs_retryable          | table             | admin
 foo    | oban_jobs_scheduled          | table             | admin
 foo    | oban_peers                   | table             | admin
 foo    | oban_producers               | table             | admin
(21 rows)

Following is the jobs rate per minute:

foo=> SELECT
  date_trunc('minute', inserted_at) as minute_group,
  COUNT(*) as row_count
FROM foo.oban_jobs
where inserted_at > '2024-02-07 05:00:00'
GROUP BY minute_group
ORDER BY minute_group;
    minute_group     | row_count
---------------------+-----------
 2024-02-07 05:00:00 |      1480
 2024-02-07 05:01:00 |      2405
 2024-02-07 05:02:00 |      2236
 2024-02-07 05:03:00 |      1894
 2024-02-07 05:04:00 |      1853
 2024-02-07 05:05:00 |      1876
 2024-02-07 05:06:00 |      2588
 2024-02-07 05:07:00 |      3017
 2024-02-07 05:08:00 |      2715
 2024-02-07 05:09:00 |      2657
 2024-02-07 05:10:00 |      2479
 2024-02-07 05:11:00 |      1219
(12 rows)

Jobs would get stuck for some time:

foo=> select state, sum(1) from foo.oban_jobs group by state order by sum;
   state   |   sum
-----------+---------
 discarded |      92
 executing |     101
 available |    4709  <-- stucked
 completed | 1513041
 scheduled | 1684103
(5 rows)

Environment

  • oban 2.17.3, oban_pro: 1.3.3
  • PostgreSQL 14.7
    • AWS RDS instance: db.m6g.2xlarge
  • elixir 1.15.6-otp-25
  • erlang 25.3.2.6

Question

Any ideas on above reporting? We've seen CPU usage reducing (the the graph above), not sure it will back to "normal" or not.

Thank you!

Also get errors like following. Is DynamicLifeline plugin campatible with Partitioning, we can ignore such errors?

GenServer {Oban.Registry, {Oban, {:plugin, Oban.Pro.Plugins.DynamicLifeline}}} terminating
** (Postgrex.Error) ERROR 40001 (serialization_failure) tuple to be locked was already moved to another partition due to concurrent update
    (ecto_sql 3.10.2) lib/ecto/adapters/sql.ex:1047: Ecto.Adapters.SQL.raise_sql_call_error/1
    (ecto_sql 3.10.2) lib/ecto/adapters/sql.ex:945: Ecto.Adapters.SQL.execute/6
    (ecto 3.10.3) lib/ecto/repo/queryable.ex:229: Ecto.Repo.Queryable.execute/4
    (oban_pro 1.3.3) lib/oban/pro/plugins/dynamic_lifeline.ex:191: Oban.Pro.Plugins.DynamicLifeline.rescue_orphaned/1
    (oban_pro 1.3.3) lib/oban/pro/plugins/dynamic_lifeline.ex:155: anonymous fn/2 in Oban.Pro.Plugins.DynamicLifeline.handle_info/2
    (telemetry 1.2.1) /opt/app/deps/telemetry/src/telemetry.erl:321: :telemetry.span/3
    (oban_pro 1.3.3) lib/oban/pro/plugins/dynamic_lifeline.ex:152: Oban.Pro.Plugins.DynamicLifeline.handle_info/2
    (stdlib 4.3.1.2) gen_server.erl:1123: :gen_server.try_dispatch/4
Last message: :rescue

There are numerous fixes for partitioning in the Pro v1.3.4 release from yesterday. Upgrading should alleviate the CPU issues you're reporting.

For the DynamicLifeline issue, this is the first report I've seen of that, and I expect that improving query performance will prevent it as well.

I can report that I'm seeing this error also. Only occurring on one of my nodes, but happening several times an hour. I'm also seeing a similar number of jobs getting stuck in executing state and they stay until I manually cancel them - the timeout feature doesn't work for these orphans.

Happy to be a guinea pig if there's anything you'd like me to try.

06:58:27.184 [error] GenServer {Oban.Registry, {Oban, {:plugin, Oban.Pro.Plugins.DynamicLifeline}}} terminating
** (Postgrex.Error) ERROR 40001 (serialization_failure) tuple to be locked was already moved to another partition due to concurrent update
    (ecto_sql 3.10.2) lib/ecto/adapters/sql.ex:1047: Ecto.Adapters.SQL.raise_sql_call_error/1
    (ecto_sql 3.10.2) lib/ecto/adapters/sql.ex:945: Ecto.Adapters.SQL.execute/6
    (ecto 3.10.3) lib/ecto/repo/queryable.ex:229: Ecto.Repo.Queryable.execute/4
    (oban_pro 1.3.4) lib/oban/pro/plugins/dynamic_lifeline.ex:192: Oban.Pro.Plugins.DynamicLifeline.rescue_orphaned/1
    (oban_pro 1.3.4) lib/oban/pro/plugins/dynamic_lifeline.ex:155: anonymous fn/2 in Oban.Pro.Plugins.DynamicLifeline.handle_info/2
    (telemetry 1.2.1) /home/ubuntu/edeliver/deps/telemetry/src/telemetry.erl:321: :telemetry.span/3
    (oban_pro 1.3.4) lib/oban/pro/plugins/dynamic_lifeline.ex:152: Oban.Pro.Plugins.DynamicLifeline.handle_info/2
    (stdlib 3.17.2) gen_server.erl:695: :gen_server.try_dispatch/4
Last message: :rescue

@darksheik are you still seeing CPU issues after upgrading to Pro v1.3.4?

No, everything is running very nicely since that upgrade in that regard, thank you! Just these occasional orphans seem to be the only hiccup remaining.

Just curious - if we had to build our own reaper for these jobs while a fix is pending, is there a specific query which determines whether a job is orphaned that we might use? We could probably try to enforce the timeout so it doesn't go past the configured time of the job, but ideally it would be great to just cancel it as soon as we know.

@sorentwo Our runaway jobs continue to be an issue. Is this something you're actively investigating, and would you prefer I open a new issue to isolate it from the indexes?

We've built our own reaper for jobs that seem like they're running too long, but we might revert the partitioned tables if the fix doesn't look easy.

@darksheik This is something I'm actively investigating and it's my highest priority. Unfortunately, I'm unable to recreate the problem synthetically and rarely see it in production myself. In the past several days I've explored many different possibilities and am working hard to diagnose the issue.

is there a specific query which determines whether a job is orphaned that we might use?

The jobs don't appear orphaned because their node and queue producer are still alive. There's nothing other than time to indicate that it's ran too long.

We've built our own reaper for jobs that seem like they're running too long, but we might revert the partitioned tables if the fix doesn't look easy.

Good to hear you built your own reaper job, that's a fine approach. The job is actually finished when you see it "stuck". It can happen with regular non-partitioned tables as well, so there's no benefit in reverting partitioning.

Sorry, I missed your reply last week... Do you not think the crash coming out of Lifeline above which mentions partitions is related the issue?

tuple to be locked was already moved to another partition due to concurrent update

The stuck job issue is fixed in Pro v1.3.5 and Oban v2.17.4

Do you not think the crash coming out of Lifeline above which mentions partitions is related the issue?

Crashes or exceptions of any kind contributed to the problem. In brief, ack operations could be lost when writing the update to the database failed. The Pro release docs and changelogs above have more details about the issue.

Thanks a ton! I'll read up and we'll try it out very soon.

@sorentwo Thanks again - this latest patch appears to have addressed the issue and we've had it in production for a couple of days.