Use READ COMMITTED isolation during the StatRegion queries

Question

Use READ COMMITTED isolation during the StatRegion queries

jwhitlock opened this issue 3 years ago · 6 comments

The StatRegion task counts the number of stations per region, as displayed on https://location.services.mozilla.com/stats/regions. This counts items per region, such as:

SELECT wifi_shard_9.region, count(*) AS count_1 FROM wifi_shard_9 WHERE wifi_shard_9.region IS NOT NULL GROUP BY wifi_shard_9.region

These queries can take a long time, even with an index, and MySQL has to keep track of all other transactions while this is running, in order to get a consistent result. The change would be:

Use a READ COMMITTED isolation level when gathering the counts
Switch to a default REPEATABLE READ isolation level to update the region counts tables.

References:

Setting Transaction Isolation Levels including DBAPI Autocommit
Transaction Isolation Levels
confidential bug 1711761 (discusses transaction history growth with specific queries)
issue #1618 - Monitor transaction history length, which led to StatRegion as a potential culprit

Answer 1 · 2021-06-16T22:01:48.000Z

Starting a new transaction inside the task with "REPEATABLE READ" isolation is not interacting well with the testing code, which wraps the test code in a transaction to rollback any database writes. The "REPEATABLE READ" connection sees an empty database from the start of the test rather than the one setup with test rows.

I'm thinking of next steps, but I'm leaning toward faking the isolation level in tests.

Answer 2 · 2021-06-17T02:21:34.000Z

A bit more info: the 16(?) queries all run in serial in one single transaction, so the history list has to keep track of all 4 hours of updates.

Answer 3 · 2021-06-17T22:21:30.000Z

I've pushed the change to the staging server. There is a nightly increase in transaction history there as well, but significantly less than production, so it will be harder to see the effect. An unexpected exception would be more significant, and good to catch and fix before going to production.

Answer 4 · 2021-07-06T21:36:15.000Z

On stage, before this change the maximum transaction history length was 1019, peaking after midnight UTC. After the change, the maximum transaction history length was 105. There were no unexpected exceptions, so I think this is safe to try in production.

Answer 5 · 2021-07-13T15:16:20.000Z

We deployed this to production, and it looks good after one run. In the previous days, the maximum transaction history ranged from 176,000 to 297,000, and stayed high for 6 to 16 hours. In last night run, the peak was 28,000, and dropped after four hours.

I'm closing the issue, and will continue to monitor.

Answer 6 · 2021-08-02T19:18:11.000Z

Oops, said I was closing in the comment, but did not close.

This change has an effect on replica lag. Instead of growing during the day and reducing to 0 seconds at off-peak times, the lag continued to grow each day, requiring manual intervention. Removing half of the async workers appears to fix the issue, with no noticeable change in the total number of observations processed.