Given a website or list of websites, fetch the web page, and then fetch all the pages linked to
by that website.
And so on? Sometimes. See below.
To install:
$ go get github.com/RickyS/Crawl
$ go get github.com/RickyS/Creep
You'll neeed both packages, the depend on each other. The main program is crawl. The working package is creep. Note the capital Cs on the names to 'go get'.
To run:
$ go run crawl.go
Or, on Linux:
$ go install -v -x && crawl
Instead of command line arguments, the crawl program reads a json file.
A simple small file is the default:
iana.json, which is among the included files.
crawl.go contains the line:
jobData := creep.LoadJobData("iana.json")
which specifies that the parameter file iana.json will be used.
Todo: The name of the json file should be a command line argument.
Technically, crawling the web isn't possible. The number of websites is effectively infinite. So after numerous tests that filled up my machine, I have artificially truncated the number of web pages crawled.
There are parameters in the json file that adjust the limitations. TBD.