yujiosaka/headless-chrome-crawler

How to broad crawl?

matheuschimelli opened this issue · 11 comments

What is the current behavior?
How to broad crawl? I'm trying to access a page, get all the links and pass them to the queue, but I can not pass the retrieved links from a page to the queue. At the moment I have an array that should be preechindo with links of the pages. This array should serve as the queue. But I can not enter data into the array.

If the current behavior is a bug, please provide the steps to reproduce

What is the expected behavior?
I expect to get the entire links from a page and pass them to queue to do a loop and visit these links too

What is the motivation / use case for changing the behavior?
Can help more people to do a broad crawl

Please tell us about your environment:

  • Version:Lasted
  • Platform / OS version: Linux Ubuntu 14.04.5 LTS trusty
  • Node.js version: 9.10.1

Set maxDepth: 0 in the queue options and the library will start crawling all the links it found on the page.

Thanks. There are any way that i can pass these links in the queue to crawl them? I cant find this on api docs.

It will add them automatically

So, i have a problem. When i set maxDepth: 0 the crawler just crawl 1 link and stops

Code

const HCCrawler = require('headless-chrome-crawler');

(async() => {
  const crawler = await HCCrawler.launch({
    // Function to be evaluated in browsers
    evaluatePage: (() => ({
      link: $('a').attr('href'),
      title: $('title').text(),
    })),
    // Function to be called with evaluated results from browsers
    onSuccess: (result => {
      console.log(result.result.title)
    }),
  });
  
  await crawler.queue({
    url: 'http://books.toscrape.com',
    maxDepth:0
  });
  await crawler.onIdle(); // Resolved when no queue is left
  await crawler.close(); // Close the crawler
})();

Console

    All products | Books to Scrape - Sandbox

That's because jQuery's attr is converting your array of <a> elements into just the first ones href (in this case it's index.html).
Return the array, loop over it and queue each url.

Ok, but it stills don't working. The only way that i can make it to work is using a for loop with a array. But the problem is the crawler speed. When i use a custom array, the speed is too slow. When i set maxDepth:100000000 it crawls the entire site without problems.

const HCCrawler = require('headless-chrome-crawler');

(async() => {
  var visitedURLs = [];
  const crawler = await HCCrawler.launch({
 
    evaluatePage: () => ({
      title: $("title").text()
    }),
    onSuccess: async result => {
      visitedURLs.push(result.options.url);
      console.log(visitedURLs.length, result.result.title.replace("/\n/", ""), result.options.url);
     
      for (const link of result.links) {
        await crawler.queue({ url: link, maxDepth: 0 });
      }
      
    },
    // catch all errors
    onError: error => {
      console.log(error);
    }
  });


  await crawler.queue({
    url: "http://books.toscrape.com",
    maxDepth: 0, // when i set to a big number as 100000 it crawl the entire site
    obeyRobotsTxt: false,
  });
  await crawler.onIdle(); 
  await crawler.close(); 
})();

Since you put it in that terms, maybe I'm confused. Try using Infinity, maybe that's what you need it this whole time and I misunderstand the crawlers programing.

I'm just saying maybe that is a bug. Can you make maxDepth to work? If yes, could you put your code here? It would be useful.

Reading the code I must admit I was wrong with the maxDepth: 0.
You can see it for yourself in here:

async _followLinks(urls, options, depth) {
if (depth >= options.maxDepth) {
this.emit(HCCrawler.Events.MaxDepthReached);
return;
}
await Promise.all(map(urls, async url => {
const _options = extend({}, options, { url });
const skip = await this._skipRequest(_options);
if (skip) return;
await this._push(_options, depth + 1, options.url);
}));
}

What you should do is to set maxDepth: Infinity. I'm busy to test it right now but it should work like that. In the following days I'll make a test crawl and publish the code in here so you can be sure it works.

If you try it and it works please share it in here!

No problems. This is my code working. I hope it would be useful for everybody. Thanks to @BubuAnabelas . I think at the moment this is just a quick fix.

const HCCrawler = require('headless-chrome-crawler');

(async() => {
  var visitedURLs = [];
  const crawler = await HCCrawler.launch({
 
    evaluatePage: () => ({
      title: $("title").text()
    }),
    onSuccess: async result => {
      visitedURLs.push(result.options.url);
      console.log(visitedURLs.length, result.result.title.replace("/\n/", ""), result.options.url);
      
    },
    // catch all errors
    onError: error => {
      console.log(error);
    }
  });


  await crawler.queue({
    url: "http://books.toscrape.com",
    maxDepth: Infinity, //solution here. max depth is Infinity for whole site crawl
    obeyRobotsTxt: false,
  });
  await crawler.onIdle(); // Resolved when no queue is left
  await crawler.close(); // Close the crawler
})();

I'm happy it worked.
Please close the issue so we know there's at least a quick fix.