cheeriojs/cheerio

Bug: Crawler reached the maxRequestsPerCrawl limit of 1 requests and will shut down soon

opendeluxe opened this issue · 1 comments

Cheerio crawler is not crawling when set maxRequestPerCrawl to 1.
Even when I set maxRequestPerCrawl to 10 or 100, after the 10th or 100th request nothing will be crawled again anymore.

I use a new instance of Cheerio for any single request, no parallel requests necessary in my usecases.
However, it counts requests on a global basis, no matter if I use a new instance of Cheerio for every request or if I use a shared instance.

Once the count of all requests is reaching the value of maxRequestPerCrawl, it will deny all further requests. The only solution is to shutdown the full process and start it again.

Log:

INFO  CheerioCrawler: Starting the crawl
INFO  CheerioCrawler: Crawler reached the maxRequestsPerCrawl limit of 1 requests and will shut down soon. Requests that are in progress will be allowed to finish.
INFO  CheerioCrawler: Earlier, the crawler reached the maxRequestsPerCrawl limit of 1 requests and all requests that were in progress at that time have now finished. In total, the crawler processed 1 requests and will shut down.
INFO  CheerioCrawler: Crawl finished. Final request statistics: {"requestsFinished":0,"requestsFailed":0,"retryHistogram":[],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":0,"requestTotalDurationMillis":0,"requestsTotal":0,"crawlerRuntimeMillis":190}

My Code:

 const crawler = new CheerioCrawler({
      minConcurrency: 1,
      maxConcurrency: 1,

      //      proxyConfiguration: {},

      // On error, retry each page at most once.
      maxRequestRetries: 1,

      // Increase the timeout for processing of each page.
      requestHandlerTimeoutSecs: 30,

      // Limit to 10 requests per one crawl
      maxRequestsPerCrawl: 1,

     async requestHandler({ request, $, proxyInfo }) {
            // ...
     }
  )}

    await crawler.run([url]);
    await crawler.teardown();
fb55 commented

You seem to be using a library on top of cheerio that is causing these issues. (Cheerio itself does not include a crawler.)