Pub/Sub Bottleneck

Question

Pub/Sub Bottleneck

pmeenan opened this issue 3 years ago · 1 comments

The testing queue currently uses pub/sub to send work to the test agents and to send retry/fail/crawled urls back for re-testing.

pub/sub has some good benefits around re-assigning tasks if an agent dies mid-test and for auto-scaling but it also has quite a few problems that may be solvable by a better queuing mechanism:

The test thruput drops pretty steeply and doesn't sustain, likely wasting capacity.
The tail of the test when the last few hundred URLs are being tested can drag out as the instance counts for the agens scale-in
Task duplicates happen from time to time so there are protections in place to make sure that tests are not duplicated

Something like beanstalkd or redis on a central server (or maybe one per region with intelligent task distribution) is likely to be able to scale better or using a small number of stable pub/sub subscriber servers that maintain state and hand work to the agents that poll them.

It's not a short-term problem but is likely wasting some testing resources by not utilizing them fully.

Answer 1 · 2024-04-08T19:33:35.000Z

We just switched to using beanstalkd for the queue management and it is working much better.