[Bug]: `0.19.4-beta.6` performance issues
dessalines opened this issue ยท 13 comments
Requirements
- Is this a bug report? For questions or discussions use https://lemmy.ml/c/lemmy_support
- Did you check to see if this issue already exists?
- Is this only a single bug? Do not put multiple bugs in one issue.
- Do you agree to follow the rules in our Code of Conduct?
- Is this a backend issue? Use the lemmy-ui repo for UI / frontend issues.
Summary
We recently upgraded lemmy.ml to 0.19.4-beta.6
, as well as upgraded to postgres 16, and some things are going much slower. I'm not sure of the culprit, but here are a few slow ones.
comment insert:
duration: 35229.395 ms execute s86834: INSERT INTO "comment" ("creator_id", "post_id", "content", "removed", "published", "updated", "deleted", "ap_id", "local", "distinguished", "language_id") VALUES ($1, $2, $3, DEFAULT, $4, DEFAULT, $5, $6, $7, DEFAULT, DEFAULT) ON CONFLICT ("ap_id") WHERE (coalesce("comment"."updated", "comment"."published") < $8) DO UPDATE SET "creator_id" = $9, "post_id" = $10, "content" = $11, "published" = $12, "deleted" = $13, "ap_id" = $14, "local" = $15 RETURNING "comment"."id", "comment"."creator_id", "comment"."post_id", "comment"."content", "comment"."removed", "comment"."published", "comment"."updated", "comment"."deleted", "comment"."ap_id", "comment"."local", "comment"."path", "comment"."distinguished", "comment"."language_id"
person insert:
duration: 41885.124 ms execute s84589: INSERT INTO "person" ("name", "public_key", "instance_id", "display_name", "avatar", "banned", "published", "updated", "actor_id", "bio", "local", "private_key", "last_refreshed_at", "banner", "deleted", "inbox_url", "shared_inbox_url", "matrix_user_id", "bot_account", "ban_expires") VALUES ($1, $2, $3, DEFAULT, DEFAULT, DEFAULT, $4, DEFAULT, $5, DEFAULT, $6, DEFAULT, $7, DEFAULT, $8, $9, $10, DEFAULT, $11, DEFAULT) ON CONFLICT ("actor_id") DO UPDATE SET "name" = $12, "public_key" = $13, "instance_id" = $14, "published" = $15, "actor_id" = $16, "local" = $17, "last_refreshed_at" = $18, "deleted" = $19, "inbox_url" = $20, "shared_inbox_url" = $21, "bot_account" = $22 RETURNING "person"."id", "person"."name", "person"."display_name", "person"."avatar", "person"."banned", "person"."published", "person"."updated", "person"."actor_id", "person"."bio", "person"."local", "person"."private_key", "person"."public_key", "person"."last_refreshed_at", "person"."banner", "person"."deleted", "person"."inbox_url", "person"."shared_inbox_url", "person"."matrix_user_id", "person"."bot_account", "person"."ban_expires", "person"."instance_id"
The standard post query is also slower.
duration: 3515.528 ms execute s51460: postgres-1
-- PostQuery::list
postgres-1 | SELECT * FROM (SELECT "post"."id", "post"."name,....
Also getting a lot of these sharelocks:
Process 87 waits for ShareLock on transaction 340080; blocked by process 53.
I'm also seeing the lemmy-federate
process having a very high CPU usage, >400%
Steps to Reproduce
See above
Technical Details
Same
Version
0.19.4-beta.6
Lemmy Instance URL
main
K here's an explain analyze. The "Trigger Name": "update_statement" seems to be the culprit.
QUERY PLAN
----------------------------------------------------------
[ +
{ +
"Plan": { +
"Node Type": "ModifyTable", +
"Operation": "Insert", +
"Parallel Aware": false, +
"Async Capable": false, +
"Relation Name": "comment", +
"Alias": "comment", +
"Startup Cost": 0.00, +
"Total Cost": 0.01, +
"Plan Rows": 0, +
"Plan Width": 0, +
"Actual Startup Time": 1.711, +
"Actual Total Time": 1.711, +
"Actual Rows": 0, +
"Actual Loops": 1, +
"Conflict Resolution": "UPDATE", +
"Conflict Arbiter Indexes": ["idx_comment_ap_id"],+
"Tuples Inserted": 0, +
"Conflicting Tuples": 1, +
"Shared Hit Blocks": 71, +
"Shared Read Blocks": 63, +
"Shared Dirtied Blocks": 15, +
"Shared Written Blocks": 0, +
"Local Hit Blocks": 0, +
"Local Read Blocks": 0, +
"Local Dirtied Blocks": 0, +
"Local Written Blocks": 0, +
"Temp Read Blocks": 0, +
"Temp Written Blocks": 0, +
"Plans": [ +
{ +
"Node Type": "Result", +
"Parent Relationship": "Outer", +
"Parallel Aware": false, +
"Async Capable": false, +
"Startup Cost": 0.00, +
"Total Cost": 0.01, +
"Plan Rows": 1, +
"Plan Width": 616, +
"Actual Startup Time": 0.152, +
"Actual Total Time": 0.153, +
"Actual Rows": 1, +
"Actual Loops": 1, +
"Shared Hit Blocks": 8, +
"Shared Read Blocks": 5, +
"Shared Dirtied Blocks": 1, +
"Shared Written Blocks": 0, +
"Local Hit Blocks": 0, +
"Local Read Blocks": 0, +
"Local Dirtied Blocks": 0, +
"Local Written Blocks": 0, +
"Temp Read Blocks": 0, +
"Temp Written Blocks": 0 +
} +
] +
}, +
"Planning": { +
"Shared Hit Blocks": 126, +
"Shared Read Blocks": 27, +
"Shared Dirtied Blocks": 0, +
"Shared Written Blocks": 0, +
"Local Hit Blocks": 0, +
"Local Read Blocks": 0, +
"Local Dirtied Blocks": 0, +
"Local Written Blocks": 0, +
"Temp Read Blocks": 0, +
"Temp Written Blocks": 0 +
}, +
"Planning Time": 3.052, +
"Triggers": [ +
{ +
"Trigger Name": "aggregates", +
"Relation": "comment", +
"Time": 0.410, +
"Calls": 1 +
}, +
{ +
"Trigger Name": "insert_statement", +
"Relation": "comment", +
"Time": 3.627, +
"Calls": 1 +
}, +
{ +
"Trigger Name": "update_statement", +
"Relation": "comment", +
"Time": 15363.898, +
"Calls": 1 +
} +
], +
"Execution Time": 15369.733 +
} +
]
(1 row)
can you post the query you ran?
Here's the slow trigger
The query:
explain (FORMAT JSON, analyze, buffers) INSERT INTO comment (creator_id, post_id, content, removed, published, updated, deleted, ap_id, local, distinguished, language_id) VALUES (701945, '15075325', 'does anyone have a brief summary of why it caused concern? ', DEFAULT, '2024-04-30 14:55:55.455344+00', DEFAULT, 'f', 'https://lemmy.world/comment/9760415', 'f', 'f', '37') ON CONFLICT (ap_id) WHERE (coalesce(comment.updated, comment.published) < '2024-04-30 14:55:55.455344+00') DO UPDATE SET creator_id = 701945, post_id = 15075325, content = 'does anyone have a brief summary of why it caused concern? ', published = '2024-04-30 14:55:55.455344+00', deleted = 'f', ap_id = 'https://lemmy.world/comment/9760415', local = 'f', distinguished = 'f', language_id = '37';
I'm downgrading lemmy.ml for now tho, we need to fix this locally and spend some more time on this.
@dullbananas lemmy.ml is back on 0.19.3
, so no rush on checking this one out. lmk if you need me to test with lemmy.ml's prod database on my dev machine.
@dessalines Could you check the execution time of each statement executed by the trigger? It might be possible by using auto_explain with log_triggers and log_analyze enabled.
Have you tried to reproduce this with the db_perf crate?
@dullbananas https://paste.centos.org/view/e746389e
There's a few comment inserts that take >30s, with trigger logging enabled, but it doesn't seem to show what's specifically slow.
Also noteworthy that this affects not just comment inserts, but every insert, person, post, etc.
{ + "Trigger Name": "insert_statement", + "Relation": "comment", + "Time": 3.627, + "Calls": 1 + }, + { + "Trigger Name": "update_statement", + "Relation": "comment", + "Time": 15363.898, + "Calls": 1 + } +
Looks like this is a problem specifically for update or upsert. insert_statement
uses the same function definition as update_statement
, except select_old_rows
is replaced with (SELECT * FROM select_new_rows WHERE FALSE)
.
Edit: this is probably wrong because the insert_statement trigger would run with 0 rows as input in this case
The duration of the transaction done when inserting a comment is probably a big contributing factor. ShareLock is used to wait for another transaction to finish.
I think this is the main problem: in the new triggers, the update to site_aggregates is not the last statement, but it probably was previously the last trigger to run because "site" is far down in alphabetical order. This would extend the duration from the start of the site_aggregates update to the end of the whole transaction, which is the period in which concurrent writers are blocked. Avoiding this problem is far more crucial for site_aggregates than for other tables because all local insertions update the same row in site_aggregates.
So just moving that statement to the bottom of the trigger would fix it?
We already have some site_aggregates and community_aggregates updates in scheduled tasks run every hour, so to me a lot of these non-crucial "site stat" type ones it makes sense to move out of triggers.
So just moving that statement to the bottom of the trigger would fix it?
Yes, and that's one of the things done in #4696
Closing this, as #4696 is merged, and it fixed the issues ๐ค . Can re-open if more pop up.