/snapcrawl

Crawl a website and take screenshots

Primary LanguageRubyMIT LicenseMIT

Snapcrawl - crawl a website and take screenshots

Build Status Gem Version Code Climate


Snapcrawl is a command line utility for crawling a website and saving screenshots.

Features

  • Crawls a website to any given depth and save screenshots
  • Can capture the full length of the page
  • Can use a specific resolution for screenshots
  • Skips capturing if the screenshot was already saved recently
  • Uses local caching to avoid expensive crawl operations if not needed
  • Reports broken links

Prerequisites

Snapcrawl requires PhantomJS and ImageMagick.

Docker Image

You can run Snapcrawl by using this docker image (which contains all the necessary prerequisites):

$ docker pull dannyben/snapcrawl

Then you can use it like this:

$ docker run --rm -it dannyben/snapcrawl --help

For more information refer to the docker-snapcrawl repository.

Install

$ gem install snapcrawl

Usage

$ snapcrawl --help

Snapcrawl

Usage:
  snapcrawl go URL [options]
  snapcrawl -h | --help 
  snapcrawl -v | --version

Options:
  -f, --folder PATH
    Where to save screenshots [default: snaps]

  -n, --name TEMPLATE
    Filename template. Include the string '%{url}' anywhere in the name to 
    use the captured URL in the filename [default: %{url}]

  -a, --age SECONDS
    Number of seconds to consider screenshots fresh [default: 86400]

  -d, --depth LEVELS
    Number of levels to crawl [default: 1]

  -W, --width PIXELS
    Screen width in pixels [default: 1280]

  -H, --height PIXELS
    Screen height in pixels. Use 0 to capture the full page [default: 0]

  -s, --selector SELECTOR
    CSS selector to capture

  -o, --only REGEX
    Include only URLs that match REGEX

  -h, --help
    Show this screen

  -v, --version
    Show version number

Examples:
  snapcrawl go example.com
  snapcrawl go example.com -d2 -fscreens
  snapcrawl go example.com -d2 > out.txt 2> err.txt &
  snapcrawl go example.com -W360 -H480
  snapcrawl go example.com --selector "#main-content"
  snapcrawl go example.com --only "products|collections"
  snapcrawl go example.com --name "screenshot-%{url}"
  snapcrawl go example.com --name "`date +%Y%m%d`_%{url}"