Pub/Sub Bottleneck
pmeenan opened this issue · 1 comments
The testing queue currently uses pub/sub to send work to the test agents and to send retry/fail/crawled urls back for re-testing.
pub/sub has some good benefits around re-assigning tasks if an agent dies mid-test and for auto-scaling but it also has quite a few problems that may be solvable by a better queuing mechanism:
- The test thruput drops pretty steeply and doesn't sustain, likely wasting capacity.
- The tail of the test when the last few hundred URLs are being tested can drag out as the instance counts for the agens scale-in
- Task duplicates happen from time to time so there are protections in place to make sure that tests are not duplicated
Something like beanstalkd or redis on a central server (or maybe one per region with intelligent task distribution) is likely to be able to scale better or using a small number of stable pub/sub subscriber servers that maintain state and hand work to the agents that poll them.
It's not a short-term problem but is likely wasting some testing resources by not utilizing them fully.
We just switched to using beanstalkd for the queue management and it is working much better.