erralb/snapcrawl

Crawl a website and take screenshots

RubyMIT

Snapcrawl - crawl a website and take screenshots

Snapcrawl is a command line utility for crawling a website and saving screenshots.

Features

Crawls a website to any given depth and save screenshots
Can capture the full length of the page
Can use a specific resolution for screenshots
Skips capturing if the screenshot was already saved recently
Uses local caching to avoid expensive crawl operations if not needed
Reports broken links

Prerequisites

Snapcrawl requires PhantomJS and ImageMagick.

Docker Image

You can run Snapcrawl by using this docker image (which contains all the necessary prerequisites):

$ docker pull dannyben/snapcrawl

Then you can use it like this:

$ docker run --rm -it dannyben/snapcrawl --help

For more information refer to the docker-snapcrawl repository.

Install

$ gem install snapcrawl

Usage

$ snapcrawl --help

Snapcrawl

Usage:
  snapcrawl go URL [options]
  snapcrawl -h | --help 
  snapcrawl -v | --version

Options:
  -f, --folder PATH
    Where to save screenshots [default: snaps]

  -n, --name TEMPLATE
    Filename template. Include the string '%{url}' anywhere in the name to 
    use the captured URL in the filename [default: %{url}]

  -a, --age SECONDS
    Number of seconds to consider screenshots fresh [default: 86400]

  -d, --depth LEVELS
    Number of levels to crawl [default: 1]

  -W, --width PIXELS
    Screen width in pixels [default: 1280]

  -H, --height PIXELS
    Screen height in pixels. Use 0 to capture the full page [default: 0]

  -s, --selector SELECTOR
    CSS selector to capture

  -o, --only REGEX
    Include only URLs that match REGEX

  -h, --help
    Show this screen

  -v, --version
    Show version number

Examples:
  snapcrawl go example.com
  snapcrawl go example.com -d2 -fscreens
  snapcrawl go example.com -d2 > out.txt 2> err.txt &
  snapcrawl go example.com -W360 -H480
  snapcrawl go example.com --selector "#main-content"
  snapcrawl go example.com --only "products|collections"
  snapcrawl go example.com --name "screenshot-%{url}"
  snapcrawl go example.com --name "`date +%Y%m%d`_%{url}"