yujiosaka/headless-chrome-crawler

Get current URL in customCrawl()

popstas opened this issue · 3 comments

What is the current behavior?
No information about current URL in customCrawl()

What is the motivation / use case for changing the behavior?
I'm want to skip request, but add URL to csv for some files like zip, doc, pdf.
My code that do it - https://github.com/viasite/sites-scraper/blob/59449b1b03/src/scrap-site.js#L240-L255

Proposal
Add crawler to customCrawl:
customCrawl: async (page, crawl, crawler)

I tried to store currentURL with requeststarted event, but it fail when more when concurrency > 1.

What do you think about it? I can make PR.

Hey @popstas
This is a valid proposal. I had the same issue. Yeah, pls do the PR. Also pls do not forget to add related info to docs. It was a while since you've posted this, so, pls let me know if you are still willing to do this.

We can use preRequest option to skip urls. we can persist or do anything to the url in there

2 years since the issue was opened, but if others in the future are looking to get the current URL, it's available in the result object of a customCrawl. Specifically result.options.url. Something like this should do the trick:

customCrawl: async (page, crawl) => {
    await page.setRequestInterception(true);
    await page.on('request', async request => await request.continue());
    await page.on('error', async err => console.log(new Error(err)));

    const result = await crawl();
    const currentUrl = result.options.url;
    // ... whatever logic you want
    return result;
}