mozilla/ichnaea

Use READ COMMITTED isolation during the StatRegion queries

jwhitlock opened this issue · 6 comments

The StatRegion task counts the number of stations per region, as displayed on https://location.services.mozilla.com/stats/regions. This counts items per region, such as:

SELECT wifi_shard_9.region, count(*) AS count_1 FROM wifi_shard_9 WHERE wifi_shard_9.region IS NOT NULL GROUP BY wifi_shard_9.region

These queries can take a long time, even with an index, and MySQL has to keep track of all other transactions while this is running, in order to get a consistent result. The change would be:

  • Use a READ COMMITTED isolation level when gathering the counts
  • Switch to a default REPEATABLE READ isolation level to update the region counts tables.

References:

Starting a new transaction inside the task with "REPEATABLE READ" isolation is not interacting well with the testing code, which wraps the test code in a transaction to rollback any database writes. The "REPEATABLE READ" connection sees an empty database from the start of the test rather than the one setup with test rows.

I'm thinking of next steps, but I'm leaning toward faking the isolation level in tests.

A bit more info: the 16(?) queries all run in serial in one single transaction, so the history list has to keep track of all 4 hours of updates.

I've pushed the change to the staging server. There is a nightly increase in transaction history there as well, but significantly less than production, so it will be harder to see the effect. An unexpected exception would be more significant, and good to catch and fix before going to production.

On stage, before this change the maximum transaction history length was 1019, peaking after midnight UTC. After the change, the maximum transaction history length was 105. There were no unexpected exceptions, so I think this is safe to try in production.

We deployed this to production, and it looks good after one run. In the previous days, the maximum transaction history ranged from 176,000 to 297,000, and stayed high for 6 to 16 hours. In last night run, the peak was 28,000, and dropped after four hours.

I'm closing the issue, and will continue to monitor.

Oops, said I was closing in the comment, but did not close.

This change has an effect on replica lag. Instead of growing during the day and reducing to 0 seconds at off-peak times, the lag continued to grow each day, requiring manual intervention. Removing half of the async workers appears to fix the issue, with no noticeable change in the total number of observations processed.