CryShana/CryCrawler

[request] TimePause between send request ro site.

Closed this issue · 2 comments

Some sites have a limit on the number of requests for a period of time or do not withstand constant requests. I suppose we need an option to limit request retry. MaxConcurrency limits only the number of simultaneous requests.

I haven't extensively tested this yet, but right now it works like this:

A crawl-delay is set automatically if it's defined in "robots.txt" file on the website. You can't set it manually yet. I could probably implement this.

If a request to website fails with following HTTP Codes: BadGateway, TooManyRequests or InternalServerError, it will be retried after 5 minutes and if it fails again, that time is doubled.

If multiple requests fail for certain domain, the entire domain is blocked for some time before retrying.

I will probably add an option to define a manual crawl-delay in seconds.

I have now fixed the version issues and implemented a global crawl delay option in config.json.

https://github.com/CryShana/CryCrawler/releases/tag/v1.0.5