Backoff messages
Closed this issue · 1 comments
Getting lots of backoff messages which result in sleeps. How does the throttling setting in Fetcher affect this?
WARN [pool-2-thread-3] 10:01:13,678 org.tallison.cc.index.io.BackoffHttpFetcher got backoff warning (#2) for https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-06/segments/1674764501555.34/warc/CC-MAIN-20230209081052-20230209111052-00825.warc.gz. Will sleep 120 seconds. Message: bad status code: 503 ::
SlowDown
Please reduce your request rate.C6X3AAMQA0VP9EPCx7TfH5kVcwDGJGNw4rwLR7gptqe/Nwh2MpYEfxIOrWt3czhmG3YU7Oa8oAF7EPxPvZAOVssap9ZEK8hV9vT6rQ==.
WARN [pool-2-thread-2] 10:02:54,901 org.tallison.cc.index.io.BackoffHttpFetcher got backoff warning (#3) for https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-06/segments/1674764500641.25/warc/CC-MAIN-20230207201702-20230207231702-00340.warc.gz. Will sleep 600 seconds. Message: bad status code: 503 ::
SlowDown
Please reduce your request rate.VPQZDEX9K6R69M1EkZ65q9fOau4XFXEiNqyCaMAViPP0TDWFo6PYSQ6udIWYKH0z41XNE1IuOZTBSKY0RBxP1cOd/Fg=.
WARN [pool-2-thread-3] 10:03:14,058 org.tallison.cc.index.io.BackoffHttpFetcher got backoff warning (#1) for https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-06/segments/1674764494976.72/warc/CC-MAIN-20230127101040-20230127131040-00572.warc.gz. Will sleep 30 seconds. Message: bad status code: 503 ::
SlowDown
Please reduce your request rate.P9TA1QBQZ93ZQV7ZV2SClLNr9PzQ7Dvdmsvs0aDhxPzD8Bb3HXyYRX+NVR9GTwCMs6ts2ASIRPU5nRmgdxE9Dum+zSM=.
Y, backoff is configurable...or should be. See fetcher's throttleSeconds
here: https://github.com/tballison/commoncrawl-fetcher-lite/blob/main/examples/default-config.json
You can set longer backoffs to avoid the messages, but then you'll be backing off longer. So, there's a balancing act...
Every LLM is now pulling from CommonCrawl, so they're throttling pretty aggressively at the moment. If you're actually on an AWS ec2 instance, and you're pulling from s3, the throttling needs evaporate. Well, when it works at all, which is most of the time.