Seeing High CPU Usage with Oban Table Partitioned

Question

Seeing High CPU Usage with Oban Table Partitioned

Closed this issue 5 months ago · 12 comments

hugo-tubi commented 5 months ago

Precheck

We see high CPU usage after 24h using table partition, which not seens with the non-partitioned way.

^ AWS RDS Chart

^ AWS Chart

Configs:

config :foo, Oban,
  prefix: "foo",
  engine: Oban.Pro.Engines.Smart,
  repo: Foo.Repo,
  plugins: [
    Oban.Pro.Plugins.DynamicLifeline,
    {
      Oban.Pro.Plugins.DynamicPartitioner,
      # updated all retentions to 1 days (two hours ago). The "cancelled" and "discarded" were 7.
      retention: [completed: 1, cancelled: 1, discarded: 1], buffer: 2, schedule: "0 11 * * *"
    },
    # we just removed (2 hours ago in this writting) Reindexer for partitioned table, assuming it is not needed any more?
    # { Oban.Plugins.Reindexer, schedule: "0 9 * * *"},
    {
      Oban.Plugins.Cron,
      crontab: [
        ... // skipped some lines
      ]
    }
  ],
  queues: [default: 50, cronjobs: 1]

Tables:

foo=> \dt foo.*
                         List of relations
 Schema |             Name             |       Type        | Owner
--------+------------------------------+-------------------+-------
 foo    | oban_jobs                    | partitioned table | admin
 foo    | oban_jobs_available          | table             | admin
 foo    | oban_jobs_cancelled          | partitioned table | admin
 foo    | oban_jobs_cancelled_20240207 | table             | admin
 foo    | oban_jobs_cancelled_20240208 | table             | admin
 foo    | oban_jobs_cancelled_20240209 | table             | admin
 foo    | oban_jobs_cancelled_20240210 | table             | admin
 foo    | oban_jobs_completed          | partitioned table | admin
 foo    | oban_jobs_completed_20240207 | table             | admin
 foo    | oban_jobs_completed_20240208 | table             | admin
 foo    | oban_jobs_completed_20240209 | table             | admin
 foo    | oban_jobs_completed_20240210 | table             | admin
 foo    | oban_jobs_discarded          | partitioned table | admin
 foo    | oban_jobs_discarded_20240207 | table             | admin
 foo    | oban_jobs_discarded_20240208 | table             | admin
 foo    | oban_jobs_discarded_20240209 | table             | admin
 foo    | oban_jobs_discarded_20240210 | table             | admin
 foo    | oban_jobs_retryable          | table             | admin
 foo    | oban_jobs_scheduled          | table             | admin
 foo    | oban_peers                   | table             | admin
 foo    | oban_producers               | table             | admin
(21 rows)

Following is the jobs rate per minute:

foo=> SELECT
  date_trunc('minute', inserted_at) as minute_group,
  COUNT(*) as row_count
FROM foo.oban_jobs
where inserted_at > '2024-02-07 05:00:00'
GROUP BY minute_group
ORDER BY minute_group;
    minute_group     | row_count
---------------------+-----------
 2024-02-07 05:00:00 |      1480
 2024-02-07 05:01:00 |      2405
 2024-02-07 05:02:00 |      2236
 2024-02-07 05:03:00 |      1894
 2024-02-07 05:04:00 |      1853
 2024-02-07 05:05:00 |      1876
 2024-02-07 05:06:00 |      2588
 2024-02-07 05:07:00 |      3017
 2024-02-07 05:08:00 |      2715
 2024-02-07 05:09:00 |      2657
 2024-02-07 05:10:00 |      2479
 2024-02-07 05:11:00 |      1219
(12 rows)

Jobs would get stuck for some time:

foo=> select state, sum(1) from foo.oban_jobs group by state order by sum;
   state   |   sum
-----------+---------
 discarded |      92
 executing |     101
 available |    4709  <-- stucked
 completed | 1513041
 scheduled | 1684103
(5 rows)

Environment

oban 2.17.3, oban_pro: 1.3.3
PostgreSQL 14.7
- AWS RDS instance: db.m6g.2xlarge
elixir 1.15.6-otp-25
erlang 25.3.2.6

Question

Any ideas on above reporting? We've seen CPU usage reducing (the the graph above), not sure it will back to "normal" or not.

Thank you!

Answer 1 · 2024-02-07T08:19:48.000Z

Also get errors like following. Is DynamicLifeline plugin campatible with Partitioning, we can ignore such errors?

GenServer {Oban.Registry, {Oban, {:plugin, Oban.Pro.Plugins.DynamicLifeline}}} terminating
** (Postgrex.Error) ERROR 40001 (serialization_failure) tuple to be locked was already moved to another partition due to concurrent update
    (ecto_sql 3.10.2) lib/ecto/adapters/sql.ex:1047: Ecto.Adapters.SQL.raise_sql_call_error/1
    (ecto_sql 3.10.2) lib/ecto/adapters/sql.ex:945: Ecto.Adapters.SQL.execute/6
    (ecto 3.10.3) lib/ecto/repo/queryable.ex:229: Ecto.Repo.Queryable.execute/4
    (oban_pro 1.3.3) lib/oban/pro/plugins/dynamic_lifeline.ex:191: Oban.Pro.Plugins.DynamicLifeline.rescue_orphaned/1
    (oban_pro 1.3.3) lib/oban/pro/plugins/dynamic_lifeline.ex:155: anonymous fn/2 in Oban.Pro.Plugins.DynamicLifeline.handle_info/2
    (telemetry 1.2.1) /opt/app/deps/telemetry/src/telemetry.erl:321: :telemetry.span/3
    (oban_pro 1.3.3) lib/oban/pro/plugins/dynamic_lifeline.ex:152: Oban.Pro.Plugins.DynamicLifeline.handle_info/2
    (stdlib 4.3.1.2) gen_server.erl:1123: :gen_server.try_dispatch/4
Last message: :rescue

Answer 2 · 2024-02-07T15:34:22.000Z

There are numerous fixes for partitioning in the Pro v1.3.4 release from yesterday. Upgrading should alleviate the CPU issues you're reporting.

For the DynamicLifeline issue, this is the first report I've seen of that, and I expect that improving query performance will prevent it as well.

Answer 3 · 2024-02-10T00:00:36.000Z

I can report that I'm seeing this error also. Only occurring on one of my nodes, but happening several times an hour. I'm also seeing a similar number of jobs getting stuck in executing state and they stay until I manually cancel them - the timeout feature doesn't work for these orphans.

Happy to be a guinea pig if there's anything you'd like me to try.

06:58:27.184 [error] GenServer {Oban.Registry, {Oban, {:plugin, Oban.Pro.Plugins.DynamicLifeline}}} terminating
** (Postgrex.Error) ERROR 40001 (serialization_failure) tuple to be locked was already moved to another partition due to concurrent update
    (ecto_sql 3.10.2) lib/ecto/adapters/sql.ex:1047: Ecto.Adapters.SQL.raise_sql_call_error/1
    (ecto_sql 3.10.2) lib/ecto/adapters/sql.ex:945: Ecto.Adapters.SQL.execute/6
    (ecto 3.10.3) lib/ecto/repo/queryable.ex:229: Ecto.Repo.Queryable.execute/4
    (oban_pro 1.3.4) lib/oban/pro/plugins/dynamic_lifeline.ex:192: Oban.Pro.Plugins.DynamicLifeline.rescue_orphaned/1
    (oban_pro 1.3.4) lib/oban/pro/plugins/dynamic_lifeline.ex:155: anonymous fn/2 in Oban.Pro.Plugins.DynamicLifeline.handle_info/2
    (telemetry 1.2.1) /home/ubuntu/edeliver/deps/telemetry/src/telemetry.erl:321: :telemetry.span/3
    (oban_pro 1.3.4) lib/oban/pro/plugins/dynamic_lifeline.ex:152: Oban.Pro.Plugins.DynamicLifeline.handle_info/2
    (stdlib 3.17.2) gen_server.erl:695: :gen_server.try_dispatch/4
Last message: :rescue

Answer 4 · 2024-02-10T00:08:27.000Z

@darksheik are you still seeing CPU issues after upgrading to Pro v1.3.4?

Answer 5 · 2024-02-10T00:09:25.000Z

No, everything is running very nicely since that upgrade in that regard, thank you! Just these occasional orphans seem to be the only hiccup remaining.

Answer 6 · 2024-02-12T17:22:51.000Z

Just curious - if we had to build our own reaper for these jobs while a fix is pending, is there a specific query which determines whether a job is orphaned that we might use? We could probably try to enforce the timeout so it doesn't go past the configured time of the job, but ideally it would be great to just cancel it as soon as we know.

Answer 7 · 2024-02-14T22:06:38.000Z

@sorentwo Our runaway jobs continue to be an issue. Is this something you're actively investigating, and would you prefer I open a new issue to isolate it from the indexes?

We've built our own reaper for jobs that seem like they're running too long, but we might revert the partitioned tables if the fix doesn't look easy.

Answer 8 · 2024-02-15T17:35:38.000Z

@darksheik This is something I'm actively investigating and it's my highest priority. Unfortunately, I'm unable to recreate the problem synthetically and rarely see it in production myself. In the past several days I've explored many different possibilities and am working hard to diagnose the issue.

is there a specific query which determines whether a job is orphaned that we might use?

The jobs don't appear orphaned because their node and queue producer are still alive. There's nothing other than time to indicate that it's ran too long.

We've built our own reaper for jobs that seem like they're running too long, but we might revert the partitioned tables if the fix doesn't look easy.

Good to hear you built your own reaper job, that's a fine approach. The job is actually finished when you see it "stuck". It can happen with regular non-partitioned tables as well, so there's no benefit in reverting partitioning.

Answer 9 · 2024-02-19T15:26:45.000Z

Sorry, I missed your reply last week... Do you not think the crash coming out of Lifeline above which mentions partitions is related the issue?

tuple to be locked was already moved to another partition due to concurrent update

Answer 10 · 2024-02-19T15:42:22.000Z

The stuck job issue is fixed in Pro v1.3.5 and Oban v2.17.4

Do you not think the crash coming out of Lifeline above which mentions partitions is related the issue?

Crashes or exceptions of any kind contributed to the problem. In brief, ack operations could be lost when writing the update to the database failed. The Pro release docs and changelogs above have more details about the issue.

Answer 11 · 2024-02-19T20:32:40.000Z

Thanks a ton! I'll read up and we'll try it out very soon.

Answer 12 · 2024-02-22T19:15:08.000Z

@sorentwo Thanks again - this latest patch appears to have addressed the issue and we've had it in production for a couple of days.