A simple web scraper

Install

go get -u github.com/dave/scrapy

Usage

scrapy [url]

The scrapy command will get get the page at url, parse it for links and get all pages that are on the same domain.

Some stats will be outputted during the processing, and a list of URLs will be printed when it's finished. You can end the job early with Ctrl+C.

Flags

Several command line flags are available:

  -length int
    	Length of the queue (default 1000)
  -timeout int
    	Request timeout in ms (default 10000)
  -url string
    	The start page (default "https://monzo.com")
  -workers int
    	Number of concurrent workers (default 5)

Library

This scraper can also be used as a library. See the scraper package.

Notes

See here for design notes and brainstorming.

Example output

Summary
-------
Queued        46
In progress   5   https://monzo.com/blog/2018/08/30/manage-your-bills
Success       22
Errors        0   

Latency
-------
   0 - 100  ***
 100 - 200 
 200 - 300 
 300 - 400  **************************
 400 - 500  ******************************
 500 - 600  ***************
 600 - 700  ***
 700 - 800  ***
 800 - 900 
 900 - 1000
1000 - 1100
1100 - 1200
1200 - 1300
1300 - 1400
1400 - 1500
1500 - 1600
1600 - 1700
1700 - 1800
1800 - 1900
1900 - 2000
2000+ 

URLs
----
https://monzo.com
https://monzo.com/-play-store-redirect
https://monzo.com/about
https://monzo.com/blog
https://monzo.com/blog/2018/07/02/publishing-our-2018-annual-report
https://monzo.com/blog/2018/07/10/making-quarterly-goals-public
https://monzo.com/blog/2018/07/25/monzo-reliability-report
https://monzo.com/blog/how-money-works
https://monzo.com/blog/latest

...