Dean's Cool Web Crawler
A CLI tool for grabbing anchors from webpages.
Installation
go run cmd/main.go -uri https://www.google.com
Syntax
This program takes one argument: a fuly-qualified URL for the webpage you wish to scrape.
go-crawler -uri <URL>
This will then output a file to your current working directory.
Anchors printed to <wd>/anchors-<timestamp>.txt! Thank you for using my cool tool!
You cna add the flag -outputtoconsole
if you would prefer to have the URLs dumped into your current terminal session:
go-crawler -uri <URL> -outputtoconsole
Tests & Benchmarks
Benchmark specification:
goos: linux
goarch: amd64
pkg: github.com/deanfoley/go-web-crawler/internal
cpu: Intel(R) Core(TM) i5-4300M CPU @ 2.60GHz
Benchmark command:
go test --bench=. -benchmem -benchtime=10s -count=5 -run=^#
NOTE: this actually didn't work properly on some of the functions and resulted in horrible race conditions. Benchmarking them individually (such as with VSCode's Go extension) works fine
/internal
PageGrabber
Test | Average Cycles | Average ns/op | Bytes/op | Allocs/op |
---|---|---|---|---|
GrabWebpage | 58,840 | 204,057 | 91,433 | 77 |
PageParser
Test | Average Cycles | Average ns/op | Bytes/op | Allocs/op |
---|---|---|---|---|
ExtractAnchors | 134,922 | 12,168 | 6,552 | 25 |
FormatAnchors | 626,671 | 2718 | 593 | 3 |
UrlParser
Test | Average Cycles | Average ns/op | Bytes/op | Allocs/op |
---|---|---|---|---|
ValidURL | 725,694 | 1,834 | 144 | 1 |
InvaldURL | 1,000,000 | 1,128 | 208 | 3 |
pprof
This project supports pprof!
Pass in a -cpuprofile and/or -memprofile flag with a desired output to output a prof file for either.
go run main.go -uri https://www.vortex.com -cpuprofile cpu.prof -memprofile mem.prof