Investigate performance issue on production

Question

Investigate performance issue on production

Closed this issue a year ago · 12 comments

Answer 1 · 2023-04-13T03:05:30.000Z

Here's the course of my investigation so far:

The failures are caused by the database becoming unresponsive

I initially expected to see the old "ECS task has a memory spike and dies" pattern, but in fact there was no memory spike, and the ECS task logs are full of "Health check failed" messages that identify the database as the unhealthy component. When that goes on long enough, the task gets killed off and another one gets started. This brings down the app for at least 10 minutes, while the new task initializes, sometimes more. I didn't look closely into what causes the "more" cases, but I think it might be that the new tasks initialize slower when the database is still in bad shape. Or possibly we're getting cases where the new task never reaches "healthy" status and it gets cycled again.

The disk was almost full

The free disk space was hovering around 6 GiB, out of 210 GiB total. That doesn't seem like enough to provide Postgres with the scratch space it needs for things like large transactions and table maintenance. It doesn't seem like we ever hit "disk full", but it seems possible that the database engine has been having to work harder due to the lack of space for writing temporary buffers.

I turned on storage auto-scaling, and it scaled up to 231 GiB shortly thereafter. I set the max to 500 GiB because we don't want to just let it grow forever. It seems like over the past two months (which is as far back as the storage logs go) we've added about 10 GiB per month, so it should be a while before we hit that limit, and by then it might be time to talk about options for reducing how much we store in the projects table (which accounts for almost all of the space in use).

We had a conversation on Slack about brainstorming ways to reduce the size of the projects table, such as by moving projects out and archiving them to S3 after a certain amount of time. But, as I said there, I think it's premature to put significant effort into saving storage space, since it's still relatively cheap (tens of dollars/month) relative to how much it would cost to design and implement any kind of system to do that.

We've been paying a lot for provisioned IOPS

The storage for the database was set up using Provisioned IOPS. As Derek noted on the thread, that was done as a stop-gap to deal with an earlier period of performance problems. But that's quite expensive (10 cents per IOP, plus it makes each GiB of storage more expensive, 12.5c instead of 11.5c), and it turns out 400 GiB of gp2 storage would get the same level of baseline IOPS (1200) for significantly less money overall. Even better, gp3 storage has a baseline of 3000 IOPS for any size volume, so switching to that would actually increase our IO capacity while decreasing the cost of the same amount of storage by 80%.

So I made that change. The database is now running with 231 GiB of gp3 storage (changes between SSD-backed storage types get applied with no downtime).

The immediate cause of the database problems is a slow query

Having enough storage space is definitely better than not having enough, but as it turned out, the database went down again twice within 12 hours of the storage increase. The picture was particularly stark this morning, where the graphs spent all night looking totally fine, then at 8:41 the ReadIOPS, ReadThroughput, and DiskQueueDepth shot up and stayed high. The app went down 30 minutes later (I think the lag is that long because we have the health check failure timeout set high). Looking at long-running queries in the database, it turns out the query it was choking on was the one from findAllPublishedProjectsPaginated. Which I believe is the same one that was causing trouble during the previous incident, when the provisioned IOPS were set up. What's weird is that we already added an index for it, in PR #969.

Next steps

That's as far as I've gotten so far. I haven't analyzed the query to see if I can figure out why it's not using the index. One possibility that occurred to me is that the index could have gotten messed up by the lack of storage space, but I don't now if that actually makes sense.

The other interesting clue that I haven't pursued yet is that this query seemingly went from being fine to being disastrously slow all of a sudden, on March 29.

There are no database IO spikes before that (actually there are some if you go back to last September or earlier but only occasionally) then suddenly they started happening all the time. My first thought was that that was when we launched the new instance type, but that was actually earlier, on March 13. The only interesting thing that happened just before 3/29 was the four "Delete user account" issues (#1300, #1301, #1303, #1306). I don't have any hypothesis for how manually deleting some users could have broken the "published projects" query, but the timing (no occurrences before 3/29, multiple per day since) is too suggestive to ignore.

So to put it in list form, though the first is pretty open-ended, the next steps are:

Diagnose and fix the slow query
Update the tfvars and, if necessary, the Terraform config to reflect the changes that have been made to the database manually in the AWS console:
- gp3 storage type
- Storage auto-scaling
- (any others that get changed related to the slow query, if applicable)

Answer 2 · 2023-04-13T13:29:16.000Z

Two further notes:

I made a nice dashboard for issue #1294 but deleted it from the actual console because it costs money to leave them around. I tried to attach it in the comment, but evidently I messed up the file. The screenshot above is from the new one I made, which is similar but not identical (this one has more database metrics, for one thing). Here's the JSON for that: LoadTestingDashboard.json
It has been about 24 hours since the last "spike DB IO and die" event, which is longer than the recent norm. It might just be a fluke, but I think it's worth waiting to see if we might have caught a break and whatever was causing it got magically resolved (e.g. by a DB autovacuum that hadn't been able to run before but can now that there's disk space). So I'm inclined to delay the "diagnose and fix slow query" step until we're sure it's necessary (i.e. until we see another sustained DiskQueueDepth spike).

Answer 3 · 2023-04-13T16:45:06.000Z

WHELP I jinxed it. Same pattern happened at 11:00am. It seems like the query did finish, after just over an hour. You can also see the effects of the higher IOPS available with gp3 storage, since ReadIOPS hovered around 2,000 and peaked at 2,500. But that didn't help with the health-check failures.

Answer 4 · 2023-04-17T14:16:05.000Z

Ok, I isolated the query and did some investigation with EXPLAIN and EXPLAIN ANALYZE. Things remained a bit confusing, because the query plans were different in different databases—which is maybe not surprising between my local dev database (with almost no data) and production, but the plans were also different between production and the testing instance I spun up using a production snapshot.
But I was able to reach some conclusions:

The "published projects" query with no state filter was using the IDX_PUBLISHED_PROJECTS index and running very fast.
Adding a state filter (which shows up as e.g. AND "regionConfig"."region_code" = 'PA' in the query) caused the production database to stop using the index and do a sequential scan on the project table. That's the thing that takes an hour and saturates the database IO to the point where health checks start failing.
The snapshot I spun up had the same problem, but when I created a second copy of the same exact index, it started using that for the state-filter query and being able to run it very quickly.
So the "somehow the index got corrupted" hypothesis seemed like a good one after all.

Armed with that knowledge, I ran REINDEX INDEX CONCURRENTLY "IDX_PUBLISHED_PROJECTS"; on the production database. It took about 45 minutes and caused a lot of read throughput, but it didn't send DiskQueueDepth too high and the site kept working fine while it was running (p99 latency was probably up a little, but not enough to stand out in the overall graph).

I believe that fixed the problem. Running EXPLAIN on the query with the state filter after the REINDEX finished showed a query plan that used the index, and running EXPLAIN ANALYZE (which I hadn't been able to do before without immediately cancelling it, since it would cause the spike/crash) finished in milliseconds. I ran the REINDEX at 11:15am on 4/15, coincidentally about half an hour after the site had recovered from another crash, and there haven't been any more incidents in the 2 days since then.

I'm not sure what lesson to take from this. I don't know if the index was actually corrupted or if it was just skewed in some way that made it look unsuitable for the state-filter query (it can't have been completely broken, since it was still being used for the query without the state filter). And if it was corrupted, I don't know if that was caused by the user deletions, the limited disk space, or some other factor that's not on my radar.

Just for reference/illustration, here are some excerpts from the psql shell of the before and after query plans: production_explain_analyze_and_reindex_results.txt

Answer 5 · 2023-04-17T15:45:00.000Z

@KlaasH thanks for the investigation and for reindexing. It seems like the reindex helped with the performance so far. I am not sure what the direct cause was for this performance issue. I think it'd be a note at least for me to run VACUUM ANALYZE and REINDEX after future deletions so that we could reduce a potential factor.

I recall you mentioned about a Terraform change as a result of this? Did I hear correctly? If yes, what is the change going to cover?

Answer 6 · 2023-04-17T18:20:30.000Z

I checked when the table had last been vacuumed and analyzed and it was within the last day or two, so that part seems fine. I would be super curious to find out if this happens again when we next delete users. We should be able to tell by running EXPLAIN on the query before and after doing user deletions. If it's still an all-index operation in the "after" plan, that would be evidence that the user deletion thing is a red herring. If it changes to requiring a sequential scan, that would be super interesting and worth investigating further. And yeah, in that case running the REINDEX INDEX CONCURRENTLY command again would make sense.

Re the Terraform changes: I was going to make a variable called rds_max_allocated_storage and set it to max_allocated_storage in the database Terraform config, since that would make the config match what's on production. But it turns out we're not using aws_db_instance directly, we're using the Azavea Terraform module (https://github.com/azavea/terraform-aws-postgresql-rds), so adding that variable would require changing the module to pass it through. Which doesn't seem worth the trouble, so I'm just going to update the allocated storage and storage class to match the console changes and call it good:

-rds_allocated_storage            = "210"
+rds_allocated_storage            = "231"
 rds_engine_version               = "13"
 rds_parameter_group_family       = "postgres13"
 rds_instance_type                = "db.t3.medium"
-rds_storage_type                 = "io1"
-rds_iops                         = "1200"
+rds_storage_type                 = "gp3"

Answer 7 · 2023-04-18T15:10:02.000Z

@KlaasH can you open an issue against that module? I may be able to update it without a horrible amount of work, though I can't guarantee when

Answer 8 · 2023-04-18T16:27:22.000Z

@KlaasH can you open an issue against that module?

I went to do so and discovered that there was an issue already: azavea/terraform-aws-postgresql-rds#40

Answer 9 · 2023-04-18T16:32:46.000Z

Still feeling good about this. Here's a graph of the last 4 weeks in which the situation since Saturday looks promisingly similar to how things looked before 3/29 when the chaos began:

Also, I'm going to just copy the dashboard JSON inline here.

{
    "widgets": [
        {
            "height": 6,
            "width": 11,
            "y": 0,
            "x": 0,
            "type": "metric",
            "properties": {
                "metrics": [
                    [ "AWS/ApplicationELB", "RequestCount", "LoadBalancer", "app/albProductionApp/56a2ff0d448c17eb", { "yAxis": "right", "region": "us-east-1", "visible": false } ],
                    [ ".", "HTTPCode_ELB_5XX_Count", ".", ".", { "yAxis": "right", "region": "us-east-1", "visible": false } ],
                    [ ".", "TargetResponseTime", ".", ".", { "yAxis": "left", "stat": "Average", "region": "us-east-1" } ],
                    [ "...", { "yAxis": "left", "stat": "p99", "region": "us-east-1" } ]
                ],
                "view": "timeSeries",
                "stacked": false,
                "region": "us-east-1",
                "stat": "Sum",
                "period": 60
            }
        },
        {
            "height": 6,
            "width": 11,
            "y": 0,
            "x": 11,
            "type": "metric",
            "properties": {
                "metrics": [
                    [ "AWS/ApplicationELB", "RequestCount", "LoadBalancer", "app/albProductionApp/56a2ff0d448c17eb", { "yAxis": "left", "region": "us-east-1" } ],
                    [ ".", "HTTPCode_ELB_5XX_Count", ".", ".", { "yAxis": "left", "region": "us-east-1" } ],
                    [ ".", "TargetResponseTime", ".", ".", { "yAxis": "right", "stat": "Average", "region": "us-east-1", "visible": false } ],
                    [ "...", { "yAxis": "right", "stat": "p99", "region": "us-east-1", "visible": false } ]
                ],
                "view": "timeSeries",
                "stacked": false,
                "region": "us-east-1",
                "stat": "Sum",
                "period": 60
            }
        },
        {
            "height": 6,
            "width": 11,
            "y": 12,
            "x": 0,
            "type": "metric",
            "properties": {
                "metrics": [
                    [ "AWS/ECS", "MemoryUtilization", "ClusterName", "ecsProductionCluster", { "yAxis": "left", "region": "us-east-1" } ],
                    [ ".", "CPUUtilization", ".", ".", { "yAxis": "left", "region": "us-east-1" } ]
                ],
                "view": "timeSeries",
                "stacked": false,
                "region": "us-east-1",
                "stat": "Average",
                "period": 60,
                "title": "ECS CPU & Memory"
            }
        },
        {
            "height": 6,
            "width": 11,
            "y": 12,
            "x": 11,
            "type": "metric",
            "properties": {
                "metrics": [
                    [ "AWS/RDS", "FreeableMemory", "DBInstanceIdentifier", "districtbuilder-production", { "yAxis": "right" } ],
                    [ ".", "CPUUtilization", ".", "." ]
                ],
                "view": "timeSeries",
                "stacked": false,
                "region": "us-east-1",
                "stat": "Average",
                "period": 60,
                "title": "Database CPU & Memory"
            }
        },
        {
            "height": 6,
            "width": 11,
            "y": 6,
            "x": 0,
            "type": "metric",
            "properties": {
                "metrics": [
                    [ "AWS/RDS", "WriteIOPS", "DBInstanceIdentifier", "districtbuilder-production" ],
                    [ ".", "ReadIOPS", ".", "." ],
                    [ ".", "WriteLatency", ".", ".", { "yAxis": "right" } ],
                    [ ".", "ReadLatency", ".", ".", { "yAxis": "right" } ]
                ],
                "view": "timeSeries",
                "stacked": false,
                "region": "us-east-1",
                "stat": "Average",
                "period": 60
            }
        },
        {
            "height": 6,
            "width": 11,
            "y": 6,
            "x": 11,
            "type": "metric",
            "properties": {
                "metrics": [
                    [ "AWS/RDS", "DiskQueueDepth", "DBInstanceIdentifier", "districtbuilder-production", { "region": "us-east-1" } ],
                    [ ".", "ReadThroughput", ".", ".", { "yAxis": "right" } ],
                    [ ".", "WriteThroughput", ".", ".", { "yAxis": "right" } ]
                ],
                "view": "timeSeries",
                "stacked": false,
                "region": "us-east-1",
                "stat": "Average",
                "period": 60,
                "title": "DB DiskQueueDepth & Throughput"
            }
        },
        {
            "height": 6,
            "width": 6,
            "y": 18,
            "x": 0,
            "type": "metric",
            "properties": {
                "view": "timeSeries",
                "stacked": false,
                "metrics": [
                    [ "AWS/RDS", "FreeStorageSpace", "DBInstanceIdentifier", "districtbuilder-production" ]
                ],
                "region": "us-east-1"
            }
        }
    ]
}

I haven't deleted that yet, since it doesn't seem like the time to fully relax has arrived, but we can delete it whenever we decide it's time.

Answer 10 · 2023-04-19T21:17:52.000Z

Still looking good. I'm going to close this.

Answer 11 · 2023-05-10T23:46:28.000Z

@KlaasH I'm getting flashbacks reading through this writeup 😅

Is it possible the lack of disk space caused the automated REINDEX to fail, which caused the corrupted index? Maybe something can alert if the REINDEX fails.

Answer 12 · 2023-05-12T15:56:17.000Z

@maurizi Ha, I forgot this was open-source and GitHub would be helpfully keeping you informed about what you're missing 😁

And wow, I missed that it was reindexing monthly. It must be that. The troubles started around the first of the month, and a failed REINDEX makes so much more sense than the index going bad from a few seemingly-unrelated data changes.