Library for crawling web site with depth limit.
Given a root URL, it returns a site map, showing which static assets each page depends on, and the links between pages.
The crawler is limited to one domain - crawling example.com it crawls all pages within the domain, but does not follow external links.
Installations requires Go1.1 for goquery.
$ go get github.com/chisler/webcrawler
Depth has its default, startUrl does not.
$ webcrawler
Usage: main -startUrl=http://example.com/ -depth=3
-depth int
Depth of crawling. (default 4)
-startUrl string
Root URL of website to crawl.
Proper running
$ webcrawler -startUrl=http://monzo.com -depth=2
...
$ ls
result.txt
$GOPATH should be set.
What's in the box
$ head result.txt
_________MAP__________
Extracted 14 pages
Execution started at 2017-03-27 15:50:22.692179718 +0300 MSK
Execution took 3.431152137s
Node: http://monzo.com/-play-store-redirect
Urls: [urllist]
Assets: [assetlist]
Run tests
$ go test github.com/chisler/webcrawler/crawl
ok github.com/chisler/webcrawler/crawl 0.003s
$ go test github.com/chisler/webcrawler/fetch
ok github.com/chisler/webcrawler/fetch 0.008s