Bug: Crawler reached the maxRequestsPerCrawl limit of 1 requests and will shut down soon
opendeluxe opened this issue · 1 comments
Cheerio crawler is not crawling when set maxRequestPerCrawl
to 1.
Even when I set maxRequestPerCrawl
to 10 or 100, after the 10th or 100th request nothing will be crawled again anymore.
I use a new instance of Cheerio for any single request, no parallel requests necessary in my usecases.
However, it counts requests on a global basis, no matter if I use a new instance of Cheerio for every request or if I use a shared instance.
Once the count of all requests is reaching the value of maxRequestPerCrawl
, it will deny all further requests. The only solution is to shutdown the full process and start it again.
Log:
INFO CheerioCrawler: Starting the crawl
INFO CheerioCrawler: Crawler reached the maxRequestsPerCrawl limit of 1 requests and will shut down soon. Requests that are in progress will be allowed to finish.
INFO CheerioCrawler: Earlier, the crawler reached the maxRequestsPerCrawl limit of 1 requests and all requests that were in progress at that time have now finished. In total, the crawler processed 1 requests and will shut down.
INFO CheerioCrawler: Crawl finished. Final request statistics: {"requestsFinished":0,"requestsFailed":0,"retryHistogram":[],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":190}
My Code:
const crawler = new CheerioCrawler({
minConcurrency: 1,
maxConcurrency: 1,
// proxyConfiguration: {},
// On error, retry each page at most once.
maxRequestRetries: 1,
// Increase the timeout for processing of each page.
requestHandlerTimeoutSecs: 30,
// Limit to 10 requests per one crawl
maxRequestsPerCrawl: 1,
async requestHandler({ request, $, proxyInfo }) {
// ...
}
)}
await crawler.run([url]);
await crawler.teardown();
You seem to be using a library on top of cheerio that is causing these issues. (Cheerio itself does not include a crawler.)