ualbertalib/jupiter

Problem Performance (very busy database)

pgwillia opened this issue · 12 comments

Describe the bug
Neil observed a very busy database via Jupiter starting April 17th, it is intermittent and ongoing. Skylight observes the same trend https://www.skylight.io/app/applications/Q2pLKz0AwRat/recent/1d/endpoints which gives us some things to look into from our end. Beau and I deployed Jupiter 2.3.5 on April 11th with a new version of pg-client bumped from 9 to 12 is the other thing that might be relevant. Coincident with a user depositing their journal articles (>100) one at a time.

We had trouble this afternoon, with the primary database server for Jupiter. Nagios clearly reports the database load average was high, ~12:30 - 14:30 (and other incidents today). Netdata on the database server reports a larger than normal number of database sessions open (60), and a lot of queries. I went grepping through the incoming web queries on the HAProxy front-end, but found just a handful of MSNBot IP addresses, each clocking up just a few low-hundreds of queries each.
(I reached a dead-end -- it seems that normal bot activity makes the DB go crazy, so now I'm throwing around speculation!)

image
image

image
image

This problem occurred repeatedly over Christmas, 2023, making it a little less cheerful.
Bump!

Occurrence Xmas '23 + Jan. '24: https://helpdesk.library.ualberta.ca/otrs/index.pl?Action=AgentTicketZoom;TicketID=68358

In particular, the 24 hour period highlighted in this graph seems to show the PostgreSQL database reading 10 million rows-per-second, on average, for Jupiter. Also troubling, Jupiter is regularly averaging reading "several" millions of rows-per-second ... while DMP does not, although DMP is not generally subject to the abusive level of web requests seen sometimes with Jupiter. (This might be an anomaly with metrics... or maybe it's something to worry about.)
image

I have developed a dashboard for our PostgreSQL server in Grafana:
http://ayr:3000/d/eed44f20-d15c-48d6-b7ce-2e1bf5ad0ef5/postgresql?orgId=1&from=now-12h&to=now&refresh=5s
Don't forget to check out Netdata for the PG server, especially the "PostgreSQL/throughput" section: http://pg-db-prd-primary-1:19999

By way of validation...
I built another visualization panel into the Grafana dashboard for PostgreSQL, in Prod (link above).
The top image is the new one. It's based on a different metric, but it follows in time with the same values, implying millions of rows of data are regularly being returned, and when application load is heavy, it's tens-of-millions of rows. Baffling!
image

And another, again in the same time scale, tracking up & down in time with the other 2 graphs. This graphs "Rate of Transactions". 200 transactions-per-second seems really high (?) when application load is heavy?
image

I may have way to generate this scale of DB traffic in Staging. First off, know that we're unwilling to dedicate a lot of disk space (in Staging) to retaining Prometheus' data for more than the default 14 days, so you can't graph backwards in Grafana for longer than a fortnight.
So, in Staging, I built a similar dashboard in Grafana for PostgreSQL pair "march/mansfield"
Below is a snapshot of what I found, for the last 14 days. It first shows no activity at all, then yesterday at noon, it started generating ~2 million rows returned per second, and kept that up until 4am.
This was definitely Omar running thesis-ingestion scripts, testing a completely different PMPY problem.
image

What are the biggest tables in the DB? How big are they? (In Prod)

 table_schema |           table_name           | total_size | data_size | external_size 
--------------+--------------------------------+------------+-----------+---------------
 public       | versions                       | 872 MB     | 450 MB    | 422 MB
 public       | active_storage_blobs           | 135 MB     | 99 MB     | 36 MB
 public       | items                          | 101 MB     | 84 MB     | 17 MB
 public       | theses                         | 97 MB      | 69 MB     | 28 MB
 ... too small, don't care, snipped...

(Oh oh - feels like there's not enough data.)
OK, how many rows in those biggest tables?

jupiter2prod=# select count(*) from versions;
 count  
--------
 140756

jupiter2prod=# select count(*) from active_storage_blobs;
 count  
--------
 387522
 
jupiter2prod=# select count(*) from items;
 count 
-------
 44671

jupiter2prod=# select count(*) from theses;
 count 
-------
 34452

This isn't right! How can the graph show tens-of-millions of row operations, when our largest table is only 390k rows?


I think I get it!

I enabled database "query-logging" for 2 hours in Staging while Omar was running a PMPY test that took >24 hours and ingested millions of things. The results found that a query of the table active_storage_attachments was using the dreaded full-table-scan query plan, and ran ~7000 times in 2 hours. And that table has ~350,000 rows. Multiply this times 2000 IP addresses each streaming queries at us, and I suspect that's how we get to reading tens-of-millions of rows per second in Prod.

The database is small enough to fit in memory, so our hardware is making up for our lack of query optimization by returning results quickly from memory, but this doesn't hold up under load!

Look at any query using tables active_storage_blobs & active_storage_attachments, and maybe versions, too!

See also, TicketID=68458, Production was "attacked" again this weekend.

I turned on slow-query-logs in Production, capturing any SQL query that takes longer than 1 second to complete. This will be just the 1% outliers, as the vast majority of these queries are completing in < 250ms.
See the log, for the masochistic: slowQueries_2024_Jan30.log (Note: includes all queries/all DB's, so there might be some DMP queries here, too)

This mirrors what I saw in Staging:

jupiter2prod=# explain SELECT "active_storage_attachments".* FROM "active_storage_attachments" WHERE "active_storage_attachments"."record_type" = 'Item' AND "active_storage_attachments"."name" = 'files' AND "active_storage_attachments"."record_id" = '0625726d-cc85-40cd-8c54-7e7e65d06df6';
                                                                        QUERY PLAN                                                                        
----------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..9567.77 rows=1 width=90)
   Workers Planned: 2
   ->  Parallel Seq Scan on active_storage_attachments  (cost=0.00..8567.67 rows=1 width=90)
         Filter: (((record_type)::text = 'Item'::text) AND ((name)::text = 'files'::text) AND (record_id = '0625726d-cc85-40cd-8c54-7e7e65d06df6'::uuid))
(4 rows)

That Parallel Seq Scan is the "full table scan" that needs query optimization -- big table!

jupiter2prod=# select count(*) from active_storage_attachments;
 count  
--------
 351470

Looking at the structure of the active_storage_attachments table as of commit 5f87c2f. There is no index on the record_id column (nor the other column in the where clause). Based on the slow query report and analysis above, one of the following might help:

jupiter_development=# \d active_storage_attachments
                                           Table "public.active_storage_attachments"
Column Type Collation Nullable Default
id bigint not null nextval('active_storage_attachments_id_seq'::regclass)
name character varying
created_at timestamp(6) without time zone
record_type character varying
fileset_uuid uuid
record_id uuid
blob_id uuid not null
Indexes:
    "active_storage_attachments_pkey" PRIMARY KEY, btree (id)
    "index_active_storage_attachments_on_blob_id" btree (blob_id)
Foreign-key constraints:
    "fk_rails_c3b3935057" FOREIGN KEY (blob_id) REFERENCES active_storage_blobs(id)
Referenced by:
    TABLE "digitization_books" CONSTRAINT "fk_rails_457c8fd1ed" FOREIGN KEY (logo_id) REFERENCES active_storage_attachments(id)

A quick, unsystematic test.

In a local dev instance, I bulk ingested ~90k items and tested without and with an index on the record_id in the active_storage_attachments table. Testing was done on a single-user system with an SSD disk and enough free RAM to fit the DB tables response time will be noticeably quicker than on a multiuser, loaded, prod DB server.

Results:

  • roughly 250x increase in response time according to explain analyze (though the actual time of one single query is not noticeable in this test environment but should be on loaded prod server) -- see below for details
  • query response time is cumulative: multiple queries to the active_storage_attachements table via the unindexed record_id are used to populate a item/thesis view or search results
  • there might be more efficient means to work with attachments in Rails 7 or N+1 queries adding to the response time.

Next steps: passing to others as my sprint is coming to an end

Baseline without an index:

jupiter_development=# explain analyze SELECT "active_storage_attachments".* FROM "active_storage_attachments" WHERE "active_storage_attachments"."record_type" = 'Item' AND "active_storage_attachments"."name" = 'files' AND "active_storage_attachments"."record_id" = '0625726d-cc85-40cd-8c54-7e7e65d06df6';
                                                                     QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on active_storage_attachments  (cost=0.00..2866.89 rows=1 width=76) (actual time=10.030..10.030 rows=0 loops=1)
   Filter: (((record_type)::text = 'Item'::text) AND ((name)::text = 'files'::text) AND (record_id = '0625726d-cc85-40cd-8c54-7e7e65d06df6'::uuid))
   Rows Removed by Filter: 91297
 Planning Time: 0.408 ms
 Execution Time: 10.040 ms
(5 rows)

Create an index:

jupiter_development=# create index if not exists record_id_idx on active_storage_attachments(record_id);
CREATE INDEX
jupiter_development=# analyze
;
ANALYZE
jupiter_development=# explain analyze SELECT "active_storage_attachments".* FROM "active_storage_attachments" WHERE "active_storage_attachments"."record_type" = 'Item' AND "active_storage_attachments"."name" = 'files' AND "active_storage_attachments"."record_id" = '0625726d-cc85-40cd-8c54-7e7e65d06df6';
                                                                QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using record_id_idx on active_storage_attachments  (cost=0.42..8.44 rows=1 width=76) (actual time=0.024..0.024 rows=0 loops=1)
   Index Cond: (record_id = '0625726d-cc85-40cd-8c54-7e7e65d06df6'::uuid)
   Filter: (((record_type)::text = 'Item'::text) AND ((name)::text = 'files'::text))
 Planning Time: 0.554 ms
 Execution Time: 0.040 ms
(5 rows)

jupiter_development=# select count(*) from active_storage_attachments;
 count
-------
 91535
(1 row)

Let's drop the index to see if the initial query was slowed by reading the table from disk into memory and the second query benefited. The result: the query was similar to the first without the index

jupiter_development=# drop index record_id_idx;
DROP INDEX
jupiter_development=# explain analyze SELECT "active_storage_attachments".* FROM "active_storage_attachments" WHERE "active_storage_attachments"."record_type" = 'Item' AND "active_storage_attachments"."name" = 'files' AND "active_storage_attachments"."record_id" = '0625726d-cc85-40cd-8c54-7e7e65d06df6';
                                                                     QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on active_storage_attachments  (cost=0.00..2920.57 rows=1 width=76) (actual time=11.285..11.285 rows=0 loops=1)
   Filter: (((record_type)::text = 'Item'::text) AND ((name)::text = 'files'::text) AND (record_id = '0625726d-cc85-40cd-8c54-7e7e65d06df6'::uuid))
   Rows Removed by Filter: 92938
 Planning Time: 0.131 ms
 Execution Time: 11.296 ms
(5 rows)

We suffered another event late last week, which drove ERA's query response time up over 8 seconds - so pretty serious operational impact. See: TicketID=69033 for details.

  • Timeframe was ~17:00 Thursday thru 17:00 Friday (April 19, 2024).
  • Resolution was to block hundreds of Class-B IPv4 networks in the local firewall; a slim majority of which belonged to AWS.
  • Again, no significant errors were reported, just query slowness, high load averages, performance madness on the DB.

Pretty fun graphs! 15-minute load averages > 10 per CPU, sustained:
image

More on this event: this graphic of the # of row operations demonstrates that the DB system averaged ~20M row-operations/s for the whole day while I didn't grok it & ineffectually blocked a few individual IP's. It even averaged 25M rows/s for a time, and touched > 30M. This suggests the overall attack was "about double" the size we suffered above, Jan 18th of this year, compare with above:
image

This issue was resolved by the efforts of @ConnorSheremeta & @jefferya culminating in the 2.9.1 release. This was installed in Production on May 27, 2024, and has eliminated operational issues with slow response-times in Production when queries reach above 6Hz.
Thanks, everyone!