go get -u github.com/dave/scrapy
scrapy [url]
The scrapy command will get get the page at url, parse it for links and get all pages that are
on the same domain.
Some stats will be outputted during the processing, and a list of URLs will be printed when it's finished. You can end the job early with Ctrl+C.
Several command line flags are available:
-length int
Length of the queue (default 1000)
-timeout int
Request timeout in ms (default 10000)
-url string
The start page (default "https://monzo.com")
-workers int
Number of concurrent workers (default 5)
This scraper can also be used as a library. See the scraper package.
See here for design notes and brainstorming.
Summary
-------
Queued 46
In progress 5 https://monzo.com/blog/2018/08/30/manage-your-bills
Success 22
Errors 0
Latency
-------
0 - 100 ***
100 - 200
200 - 300
300 - 400 **************************
400 - 500 ******************************
500 - 600 ***************
600 - 700 ***
700 - 800 ***
800 - 900
900 - 1000
1000 - 1100
1100 - 1200
1200 - 1300
1300 - 1400
1400 - 1500
1500 - 1600
1600 - 1700
1700 - 1800
1800 - 1900
1900 - 2000
2000+
URLs
----
https://monzo.com
https://monzo.com/-play-store-redirect
https://monzo.com/about
https://monzo.com/blog
https://monzo.com/blog/2018/07/02/publishing-our-2018-annual-report
https://monzo.com/blog/2018/07/10/making-quarterly-goals-public
https://monzo.com/blog/2018/07/25/monzo-reliability-report
https://monzo.com/blog/how-money-works
https://monzo.com/blog/latest
...