HTTPArchive/data-pipeline

Pub/Sub Bottleneck

pmeenan opened this issue · 1 comments

The testing queue currently uses pub/sub to send work to the test agents and to send retry/fail/crawled urls back for re-testing.

pub/sub has some good benefits around re-assigning tasks if an agent dies mid-test and for auto-scaling but it also has quite a few problems that may be solvable by a better queuing mechanism:

  • The test thruput drops pretty steeply and doesn't sustain, likely wasting capacity.
  • The tail of the test when the last few hundred URLs are being tested can drag out as the instance counts for the agens scale-in
  • Task duplicates happen from time to time so there are protections in place to make sure that tests are not duplicated

Something like beanstalkd or redis on a central server (or maybe one per region with intelligent task distribution) is likely to be able to scale better or using a small number of stable pub/sub subscriber servers that maintain state and hand work to the agents that poll them.

It's not a short-term problem but is likely wasting some testing resources by not utilizing them fully.

We just switched to using beanstalkd for the queue management and it is working much better.