creepycrawler is a simple, concurrent web scraper and mapper, written in Go.
It is not special or significant in any way.
It's also the second thing I've ever written in Go, so a little slack would be appreciated! 🙏
Considerations which have not been implemented, but are arguably out of scope for this assignment:
- robots.txt testing (don't scrape things which we aren't allowed to)
- Display of a proper "web" of relations
- this is stored and can be easily calculated, but without writing a load of boilerplate code for making fancy webs, the tree will have to do (it prints a note next to back references)
- Better logging / the ability to turn logging off
- Right now it's super verbose so you can see what's happening, but I would probably switch to using something like
github.com/sirupsen/logrus
to add definable loglevels.
- Right now it's super verbose so you can see what's happening, but I would probably switch to using something like
Package versions are pinned using go dep
.
I've written the stack like this:
- Command comes in through
cmd/creepycrawler/main.go
cmd
callscrawl.WalkTarget
with the target URLcrawl.WalkTarget
pulls the target URL into aHtmlPage
struct, parses it, and fills it out
- Any links on the page are checked against a central truth store (to avoid branch duplication)
- This truth store is a simple
map
accessible only via a read and write channel (controlled by a single worker goroutine)
crawl.WalkTarget
then callsHtmlPage.recurse()
- For every link in
HtmlPage
,recurse()
starts a new goroutine
- This goroutine runs the page scraper, then calls
recurse()
of its own - The idea being that each subpage gets its own goroutine worker, then each subpage of each subpage gets its own goroutine worker, etcetera
- To ensure workers don't try to start recursing a page already belonging to another worker, a
mutex
is used onHtmlPage
- Once each worker has finished, it returns a Completed boolean message on a channel back to its owning worker, then exits
- A
recurse()
func only returns once all of its launched subworkers have returned Completed messages
crawl.WalkTarget
returns a completed HtmlPage containing a web of pointers to other HtmlPages.
- Loops are possible if you just keep recursing down
HtmlPage.LinksTo
cmd
then calls thedisplayTree
package in order to get a human-readable tree, which is calculated with the help ofgithub.com/disiqueira/gotree
- Some funky deduplication / recursion checking goes on inside
displayTree
to avoid infinite loops / make it clear which sublinks are links to other, already found pages
cmd
prints the result ofdisplayTree
to console
A makefile is included to make this as simple as possible. To run:
- Move everything here to
$GOROOT/src/github.com/luaduck/creepycrawler
- Install dependencies:
make dep
- You need
dep
installed to do this. If it's not available for whatever reason,go get -u github.com/golang/dep/cmd/dep
- (optionally) Run tests:
make test
- Build the app:
make build
- This drops a binary at
./creepycrawler
- Run the crawler:
./creepycrawler
- The syntax is as follows:
./creepycrawler [OPTIONS] domain
- The option
-show-backrefs
will show entries in the tree which refer to a page whose tree has already been displayed, such as those further back down the stack (e.g a subpage referring back to root): turning this off will purely show a list of all domains, along with the depth at which they were first discovered domain
should be defined in full RFC1738 format. If a scheme isn't provided, it will default tohttps
.
creepycrawler is licensed under the Unlicense. Please see LICENSE for more information.