Get current URL in customCrawl()

Question

Get current URL in customCrawl()

popstas opened this issue 4 years ago · 3 comments

What is the current behavior?
No information about current URL in customCrawl()

What is the motivation / use case for changing the behavior?
I'm want to skip request, but add URL to csv for some files like zip, doc, pdf.
My code that do it - https://github.com/viasite/sites-scraper/blob/59449b1b03/src/scrap-site.js#L240-L255

Proposal
Add crawler to customCrawl:
customCrawl: async (page, crawl, crawler)

I tried to store currentURL with requeststarted event, but it fail when more when concurrency > 1.

What do you think about it? I can make PR.

Answer 1 · 2020-10-17T06:35:12.000Z

Hey @popstas
This is a valid proposal. I had the same issue. Yeah, pls do the PR. Also pls do not forget to add related info to docs. It was a while since you've posted this, so, pls let me know if you are still willing to do this.

Answer 2 · 2022-06-19T06:28:26.000Z

We can use preRequest option to skip urls. we can persist or do anything to the url in there

Answer 3 · 2022-07-09T18:20:29.000Z

2 years since the issue was opened, but if others in the future are looking to get the current URL, it's available in the result object of a customCrawl. Specifically result.options.url. Something like this should do the trick:

customCrawl: async (page, crawl) => {
    await page.setRequestInterception(true);
    await page.on('request', async request => await request.continue());
    await page.on('error', async err => console.log(new Error(err)));

    const result = await crawl();
    const currentUrl = result.options.url;
    // ... whatever logic you want
    return result;
}