HTTPArchive/legacy.httparchive.org

Come up with plans to increase capacity

rviscomi opened this issue · 2 comments

The crawl is uncomfortably close to exceeding its 14 day window. In an effort to provide more headroom and enable 10x growth down the line, we need to reevaluate the current crawl configuration and hardware capabilities.

To do:

  • identify easy short-term solutions for gaining headroom
  • come up with a long-term strategy for increasing capacity 10x

Update: We've tripled capacity to 1.3M URLs but still running into some issues with hardware failures. We're also limited in how much we can do with Lighthouse, eg unused JS detection requires multiple passes, no desktop auditing yet. Leaving this issue open for the next sync.

Capacity is ~10x now after reducing to monthly crawls and 1 run per test. See https://discuss.httparchive.org/t/changes-to-the-http-archive-corpus/1539 for more info