/site-walker

Primary LanguageJavaScript

SiteWalker.js

Simple web crawler with basic capability to crawl next page based on callback

How to install

$ npm install site-walker

Usage

var SiteWalker = require("site-walker")
var instance = new SiteWalker("http://someawesome.site.com",function(pageStr){
    //callback is fired when page is successfully crawled
    //pageStr contains crawled page, in string
    //do some scrapping here and there
    var nextUrl = "http://someawesome.site.com/page/2" //assume that page/2 is scrapped from current pageStr
    this.next(nextUrl)
})
instance
.then(function(){
    //fired when no nextUrl is supplied from callback
})
.catch(function(reason){
    //fired when error on retrieving page.
})
instance.crawl() //invoke crawling

You can call this.next(nextUrl) several times during callback. If so, the next url that will be crawled the first supplied nextUrl, and so on. For example :

    //supplied callback
    function(pageStr){
        //scrap scrap
        this.next(url1);
        this.next(url2);
        if(someConditionIsMet){
            this.next(url3)
        }
    }

the crawled page order will be :

url1 -> url2 -> url1 -> url2

If during callback, someConditionisMet evaluate to true, the order of execution will be :

url1 -> url2 -> url3 -> url1 -> url2

Notes

  • Currently, if during crawling a URL is failed to be crawled, SiteWalker will break the execution and throw reject
  • No stop() method is available. So, if you keep supplying nextUrl on callback, SiteWalker will run forever (theoretically)

GitHub

https://github.com/aerios/site-walker