Implement a recursive, mirroring web crawler . The crawler should be a command-line tool that accepts a starting URL and a destination directory. The crawler will then download the page at the URL, save it in the destination directory, and then recursively proceed to any valid links in this page.
A valid link is the value of an href attribute in an <a>
tag the resolves to urls that are children of the initial
URL.
For example, given initial URL https://start.url/abc , URLs that resolve to https://start.url/abc/foo and https://start.url/abc/foo/bar are valid URLs, but ones that resolve to https://another.domain/ or to https://start.url/baz are not valid URLs, and should be skipped.
Additionally, the crawler should:
- Correctly handle being interrupted by Ctrl-C
- Perform work in parallel where reasonable
- Support resume functionality by checking the destination directory for downloaded pages and skip downloading and processing where not necessary
- Provide “happy-path” test coverage
Some tips:
- If you’re not familiar with this kind of software, see
wget --mirror
for very similar functionality - Document missing features and any other changes you would make if you had more time for the assignment implementation.
Install crawler
using the command below:
go install
To view available options and usage, run
crawler --help
To crawl a website, run
crawler -s https://example.com -d downloads
The project by default uses github actions to run tests and benchmarks. To run tests locally, please run the commands below.
Tests:
go test -cover -race ./... -v
Benchmarks:
go test -bench=. ./...
Benchmark Results on Apple M1 Pro:
goos: darwin
goarch: arm64
pkg: github.com/jwambugu/crawler/cmd/crawler
BenchmarkCrawler_Crawl-8 25808 44546 ns/op
BenchmarkCrawler_CrawlWithoutConcurrency-8 34762 34582 ns/op
PASS
ok github.com/jwambugu/crawler/cmd/crawler 3.955s
Benchmark Results on Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz:
goos: linux
goarch: amd64
pkg: github.com/jwambugu/crawler/cmd/crawler
cpu: Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
BenchmarkCrawler_Crawl-2 15939 75629 ns/op
BenchmarkCrawler_CrawlWithoutConcurrency-2 23270 50507 ns/op
PASS
ok github.com/jwambugu/crawler/cmd/crawler 3.680s
- The crawler should not exit if a link returns a 404. It should attempt to go back to the previous link and skip the missing link's URL.
- Keep track of the last crawled link and resume from it instead of starting afresh.