apify/crawlee

Cannot use multiple 'PlaywrightCrawlers' simultaneously

Closed this issue · 1 comments

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/playwright (PlaywrightCrawler)

Issue description

When multiple 'PlaywrightCrawlers' are enabled, if one task ends, it will cause the entire process to end the task

Code sample

const { PlaywrightCrawler } = require('crawlee');

// 定义爬虫的配置
const crawlerConfig = {
    // 爬虫配置...
};

// 创建并启动第一个爬虫实例
const crawler1 = new PlaywrightCrawler(crawlerConfig);
crawler1.run(["https://amazon.com"]);

// 创建并启动第二个爬虫实例
const crawler2 = new PlaywrightCrawler(crawlerConfig);
crawler1.run(["https://amazon.com"]);

Package version

3.10.5

Node.js version

v20.13.1

Operating system

windows

Apify platform

  • Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

No response

Other context

I think it may be caused by the

process.once('SIGINT', sigintHandler);

process.once('SIGINT', sigintHandler);

This might be caused by both crawlers sharing the same storage. You can tell Crawlee to use different storage backends with each crawler by supplying the optional second constructor parameter.

const crawler = new CheerioCrawler(
      {
          ...crawlerOptions
      }, 
+    new Configuration({
+       persistStorage: false,
+    })
);

In the Configuration, you can:

  • Either set up an memory-only crawl (by using persistStorage: false - this also causes the crawlers to not share the storage, as each crawler now uses a separate in-memory storage backend).
  • Use storageClientOptions.localDataDirectory to tell each crawler to save data to its own directory.
  • Use different default(Dataset|KeyValueStore|RequestQueue)Id options to tell each crawler to store its data in separate datasets / KVS / request queue.

Let us know whether this helped. Cheers!