website-scraper/website-scraper-puppeteer

Whats the best approach to speed things up?

Opened this issue · 2 comments

Hi,

I'm looking for ways to speedup the crawling process.
Where website-scraper takes up to 8 minutes to crawl a site, website-scraper-puppeteer needs 40 minutes for the same site. (sure I expect a penalty)

Increasing cpu resources only helps to a certain point as I see website-scraper-puppeteer (chromium) not taking all cpu available.

Would it be possible to only use website-scraper-puppeteer for JS scraping/execution? and leave the rest up to website-scraper? If so, how?

Looking at puppeteer and performance increase options, I came across pupeteer-pool. (https://github.com/latesh/puppeteer-pool)
Would it be possible to use pupeteer-pool from within website-scraper-puppeteer? if so how?

Tried the last option, but I'm not very into node so I failed to get it to work.

Gr, J

Hi @jalbstmeijer

Sorry for late response

website-scraper-puppeteer is used only for js execution on html pages, other resources (like images, styles, etc.) are downloaded by default functionality in website-scraper (without puppeteer)

I think it's possible to increase performance using pool for puppeteer, but it requires some development. Unfortunately, I do not have time for this

stale commented

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.