LemmyNet/lemmy

[Bug]: `0.19.4-beta.6` performance issues

dessalines opened this issue ยท 13 comments

Requirements

  • Is this a bug report? For questions or discussions use https://lemmy.ml/c/lemmy_support
  • Did you check to see if this issue already exists?
  • Is this only a single bug? Do not put multiple bugs in one issue.
  • Do you agree to follow the rules in our Code of Conduct?
  • Is this a backend issue? Use the lemmy-ui repo for UI / frontend issues.

Summary

We recently upgraded lemmy.ml to 0.19.4-beta.6, as well as upgraded to postgres 16, and some things are going much slower. I'm not sure of the culprit, but here are a few slow ones.

comment insert:

duration: 35229.395 ms  execute s86834: INSERT INTO "comment" ("creator_id", "post_id", "content", "removed", "published", "updated", "deleted", "ap_id", "local", "distinguished", "language_id") VALUES ($1, $2, $3, DEFAULT, $4, DEFAULT, $5, $6, $7, DEFAULT, DEFAULT) ON CONFLICT ("ap_id") WHERE (coalesce("comment"."updated", "comment"."published") < $8) DO UPDATE SET "creator_id" = $9, "post_id" = $10, "content" = $11, "published" = $12, "deleted" = $13, "ap_id" = $14, "local" = $15 RETURNING "comment"."id", "comment"."creator_id", "comment"."post_id", "comment"."content", "comment"."removed", "comment"."published", "comment"."updated", "comment"."deleted", "comment"."ap_id", "comment"."local", "comment"."path", "comment"."distinguished", "comment"."language_id"

person insert:

duration: 41885.124 ms  execute s84589: INSERT INTO "person" ("name", "public_key", "instance_id", "display_name", "avatar", "banned", "published", "updated", "actor_id", "bio", "local", "private_key", "last_refreshed_at", "banner", "deleted", "inbox_url", "shared_inbox_url", "matrix_user_id", "bot_account", "ban_expires") VALUES ($1, $2, $3, DEFAULT, DEFAULT, DEFAULT, $4, DEFAULT, $5, DEFAULT, $6, DEFAULT, $7, DEFAULT, $8, $9, $10, DEFAULT, $11, DEFAULT) ON CONFLICT ("actor_id") DO UPDATE SET "name" = $12, "public_key" = $13, "instance_id" = $14, "published" = $15, "actor_id" = $16, "local" = $17, "last_refreshed_at" = $18, "deleted" = $19, "inbox_url" = $20, "shared_inbox_url" = $21, "bot_account" = $22 RETURNING "person"."id", "person"."name", "person"."display_name", "person"."avatar", "person"."banned", "person"."published", "person"."updated", "person"."actor_id", "person"."bio", "person"."local", "person"."private_key", "person"."public_key", "person"."last_refreshed_at", "person"."banner", "person"."deleted", "person"."inbox_url", "person"."shared_inbox_url", "person"."matrix_user_id", "person"."bot_account", "person"."ban_expires", "person"."instance_id"

The standard post query is also slower.

duration: 3515.528 ms  execute s51460:  postgres-1 
-- PostQuery::list
postgres-1  |   SELECT * FROM (SELECT "post"."id", "post"."name,....

Also getting a lot of these sharelocks:

Process 87 waits for ShareLock on transaction 340080; blocked by process 53.

I'm also seeing the lemmy-federate process having a very high CPU usage, >400%

Steps to Reproduce

See above

Technical Details

Same

Version

0.19.4-beta.6

Lemmy Instance URL

main

K here's an explain analyze. The "Trigger Name": "update_statement" seems to be the culprit.

cc @dullbananas @phiresky

QUERY PLAN                        
----------------------------------------------------------
 [                                                       +
   {                                                     +
     "Plan": {                                           +
       "Node Type": "ModifyTable",                       +
       "Operation": "Insert",                            +
       "Parallel Aware": false,                          +
       "Async Capable": false,                           +
       "Relation Name": "comment",                       +
       "Alias": "comment",                               +
       "Startup Cost": 0.00,                             +
       "Total Cost": 0.01,                               +
       "Plan Rows": 0,                                   +
       "Plan Width": 0,                                  +
       "Actual Startup Time": 1.711,                     +
       "Actual Total Time": 1.711,                       +
       "Actual Rows": 0,                                 +
       "Actual Loops": 1,                                +
       "Conflict Resolution": "UPDATE",                  +
       "Conflict Arbiter Indexes": ["idx_comment_ap_id"],+
       "Tuples Inserted": 0,                             +
       "Conflicting Tuples": 1,                          +
       "Shared Hit Blocks": 71,                          +
       "Shared Read Blocks": 63,                         +
       "Shared Dirtied Blocks": 15,                      +
       "Shared Written Blocks": 0,                       +
       "Local Hit Blocks": 0,                            +
       "Local Read Blocks": 0,                           +
       "Local Dirtied Blocks": 0,                        +
       "Local Written Blocks": 0,                        +
       "Temp Read Blocks": 0,                            +
       "Temp Written Blocks": 0,                         +
       "Plans": [                                        +
         {                                               +
           "Node Type": "Result",                        +
           "Parent Relationship": "Outer",               +
           "Parallel Aware": false,                      +
           "Async Capable": false,                       +
           "Startup Cost": 0.00,                         +
           "Total Cost": 0.01,                           +
           "Plan Rows": 1,                               +
           "Plan Width": 616,                            +
           "Actual Startup Time": 0.152,                 +
           "Actual Total Time": 0.153,                   +
           "Actual Rows": 1,                             +
           "Actual Loops": 1,                            +
           "Shared Hit Blocks": 8,                       +
           "Shared Read Blocks": 5,                      +
           "Shared Dirtied Blocks": 1,                   +
           "Shared Written Blocks": 0,                   +
           "Local Hit Blocks": 0,                        +
           "Local Read Blocks": 0,                       +
           "Local Dirtied Blocks": 0,                    +
           "Local Written Blocks": 0,                    +
           "Temp Read Blocks": 0,                        +
           "Temp Written Blocks": 0                      +
         }                                               +
       ]                                                 +
     },                                                  +
     "Planning": {                                       +
       "Shared Hit Blocks": 126,                         +
       "Shared Read Blocks": 27,                         +
       "Shared Dirtied Blocks": 0,                       +
       "Shared Written Blocks": 0,                       +
       "Local Hit Blocks": 0,                            +
       "Local Read Blocks": 0,                           +
       "Local Dirtied Blocks": 0,                        +
       "Local Written Blocks": 0,                        +
       "Temp Read Blocks": 0,                            +
       "Temp Written Blocks": 0                          +
     },                                                  +
     "Planning Time": 3.052,                             +
     "Triggers": [                                       +
       {                                                 +
         "Trigger Name": "aggregates",                   +
         "Relation": "comment",                          +
         "Time": 0.410,                                  +
         "Calls": 1                                      +
       },                                                +
       {                                                 +
         "Trigger Name": "insert_statement",             +
         "Relation": "comment",                          +
         "Time": 3.627,                                  +
         "Calls": 1                                      +
       },                                                +
       {                                                 +
         "Trigger Name": "update_statement",             +
         "Relation": "comment",                          +
         "Time": 15363.898,                              +
         "Calls": 1                                      +
       }                                                 +
     ],                                                  +
     "Execution Time": 15369.733                         +
   }                                                     +
 ]
(1 row)

can you post the query you ran?

Here's the slow trigger

CALL r.create_triggers ('comment', $$

The query:

explain (FORMAT JSON, analyze, buffers) INSERT INTO comment (creator_id, post_id, content, removed, published, updated, deleted, ap_id, local, distinguished, language_id) VALUES (701945, '15075325', 'does anyone have a brief summary of why it caused concern? ', DEFAULT, '2024-04-30 14:55:55.455344+00', DEFAULT, 'f', 'https://lemmy.world/comment/9760415', 'f', 'f', '37') ON CONFLICT (ap_id) WHERE (coalesce(comment.updated, comment.published) < '2024-04-30 14:55:55.455344+00') DO UPDATE SET creator_id = 701945, post_id = 15075325, content = 'does anyone have a brief summary of why it caused concern? ', published = '2024-04-30 14:55:55.455344+00', deleted = 'f', ap_id = 'https://lemmy.world/comment/9760415', local = 'f', distinguished = 'f', language_id = '37';

I'm downgrading lemmy.ml for now tho, we need to fix this locally and spend some more time on this.

@dullbananas lemmy.ml is back on 0.19.3, so no rush on checking this one out. lmk if you need me to test with lemmy.ml's prod database on my dev machine.

@dessalines Could you check the execution time of each statement executed by the trigger? It might be possible by using auto_explain with log_triggers and log_analyze enabled.

Have you tried to reproduce this with the db_perf crate?

@dullbananas https://paste.centos.org/view/e746389e

There's a few comment inserts that take >30s, with trigger logging enabled, but it doesn't seem to show what's specifically slow.

Also noteworthy that this affects not just comment inserts, but every insert, person, post, etc.

       {                                                 +
         "Trigger Name": "insert_statement",             +
         "Relation": "comment",                          +
         "Time": 3.627,                                  +
         "Calls": 1                                      +
       },                                                +
       {                                                 +
         "Trigger Name": "update_statement",             +
         "Relation": "comment",                          +
         "Time": 15363.898,                              +
         "Calls": 1                                      +
       }                                                 +

Looks like this is a problem specifically for update or upsert. insert_statement uses the same function definition as update_statement, except select_old_rows is replaced with (SELECT * FROM select_new_rows WHERE FALSE).

Edit: this is probably wrong because the insert_statement trigger would run with 0 rows as input in this case

The duration of the transaction done when inserting a comment is probably a big contributing factor. ShareLock is used to wait for another transaction to finish.

I think this is the main problem: in the new triggers, the update to site_aggregates is not the last statement, but it probably was previously the last trigger to run because "site" is far down in alphabetical order. This would extend the duration from the start of the site_aggregates update to the end of the whole transaction, which is the period in which concurrent writers are blocked. Avoiding this problem is far more crucial for site_aggregates than for other tables because all local insertions update the same row in site_aggregates.

So just moving that statement to the bottom of the trigger would fix it?

We already have some site_aggregates and community_aggregates updates in scheduled tasks run every hour, so to me a lot of these non-crucial "site stat" type ones it makes sense to move out of triggers.

So just moving that statement to the bottom of the trigger would fix it?

Yes, and that's one of the things done in #4696

Closing this, as #4696 is merged, and it fixed the issues ๐Ÿคž . Can re-open if more pop up.