usgo/agagd

Testing and improving the response time issues that appear in production but not in development

abaisero opened this issue · 2 comments

The response time from the production server is highly variable, in my experience taking anywhere from 2 to 90 seconds. Given the high variability, I assume it's likely that there are multiple contributing factors to this issue. Nonetheless, I think I've found at least one issue which may significantly contribute: the number of queries made to the db can be greatly reduced by multiple factors by changing how the tables are rendered in various pages (e.g., rendering the "player_name (player_id)" values currently results in a separate db query for each row of the table).

However, this issue only appear in production, and not in a local development instance, so it is currently impossible for someone to locally verify any fix. To be clear: I can verify that the number of queries is reduced by a large factor, but I cannot verify that this results in significant real-time gains). This might be because the production db is significantly larger than the development db (which is consistent with the hypothesis that the number of queries contributes to the issue), but there might be other reasons as well.

What would be a good way to replicate in development the timing issues that appear in production? Would it be possible to access a copy of the production db somehow?

I would say this is our number one problem right now. The main person who’s been working on this recently has become unavailable for the foreseeable future, but I can try to make some time this week to be able to look at your work on this and see if we can get it into production.

Confirming the issue and the fix locally would be great, but to be honest I’d be comfortable making fixes to reduce the query reduction and deploying those to live to see how much they improve things. Things that are unambiguous improvements, that don’t require modification of a feature to decrease db load, are basically straight benefit.

Sounds good; in the end I found that creating larger fake fixtures was enough to make the timing issues also appear in development. I just submitted a pull request which reduces the number of queries for some pages by a factor of 100 in the development environment, and supposedly by a lot more in the production environment which uses the real much larger database.