/sitemap

Simple little utility to crawl a website

Primary LanguageGoMIT LicenseMIT

Sitemap Generator

Summary

This little utility, given a starting URL, will crawl a website and find all the static assets and links on that site.

Design Goals

  • Crawl an entire site and report on its structure
  • Flexible output formats (i.e. json, tab, digraph)
  • Customize performance characteristics

Design Decisions

  • The utility will stay within the same domain
  • THe utility, when it finds duplicate URLs, it will not traverse into its links, but still report on the links found.

Features

  • Ability to save results to a file
  • Set number of worker threads/goroutines to crawl a site
  • Set rate limiter, if desired
  • Set inactivity timeout
  • Read in saved results and redisplay in different formats

How to get it

(1) You have Docker installed

docker run mkboudreau/sitemap ....

(2) You have Go installed

go get github.com/mkboudreau/sitemap 
make install

Example Usage

Crawl site with sensible defaults

sitemap www.microsoft.com

Crawl site with 50 workers

sitemap -w 50 www.microsoft.com

Crawl site with rate limiting turned off

sitemap -r 0s www.microsoft.com

Crawl site and output JSON

sitemap -f json www.microsoft.com

Crawl site and output tabular format (default)

sitemap -f tab www.microsoft.com

Crawl site and output digraph (dot)

sitemap -f digraph www.microsoft.com

Crawl site and save results to file

sitemap -o saved.json www.microsoft.com

Use saved results and output as a digraph

sitemap -i saved.json -f digraph