/crab

🦀 A versatile tool to crawl dozens of URLs from a given source, like a sitemap or an URL list.

Primary LanguageGoMIT LicenseMIT

crab

.github/workflows/build.yml Go Report Card Coverage Status

A versatile tool to crawl dozens of URLs from a given source, like a sitemap or an URL list.

Useful for:

  • Warming site caches
  • Checking response times
  • Identifying dead or broken pages

Install

Binaries

You can download the newest release from here for Linux (including .deb and .rpm), macOS and Windows.

Build Command (Personal & Development Usage)

#Linux (Debian/Ubuntu) & MacOS
$ go build -o crab cmd/crab/main.go

#Windows
$ go build -o crab.exe cmd/crab/main.go

Docker

Docker Hub

$ docker pull atomicptr/crab
# Example
$ docker run --rm atomicptr/crab --help
$ docker run --rm atomicptr/crab crawl:sitemap https://domain.com/sitemap.xml

Arch Linux

AUR

Nix

Not available in nixpkgs but I have my own nix repository which you can use:

let
  atomicptr = import (fetchTarball "https://github.com/atomicptr/nix/archive/refs/heads/master.tar.gz") {};
in
{
  environment.systemPackages = with pkgs; [
    atomicptr.crab
  ];
}

Homebrew

$ brew install atomictr/tools/crab

Scoop

$ scoop bucket add atomicptr https://github.com/atomicptr/scoop-bucket
$ scoop install crab

Usage

Crawl singular URLs:

$ crab crawl https://domain.com https://domain.com/test
{"status": 200, "url": "https://domain.com", ...}
...

Crawl through a sitemap:

$ crab crawl:sitemap https://domain.com/sitemap.xml

Replace all URLs with a different one:

$ crab crawl:sitemap https://domain.com/sitemap.xml --prefix-url=https://staging.domain.com

Add some cookies/headers:

$ crab crawl:sitemap https://domain.com/sitemap.xml --cookie auth_token=12345 --header X-Bypass-Cache=1

Filter by Status Code

You can filter the output by it's status code

# This will only return responses with a 200 OK
$ crab crawl:sitemap https://domain.com/sitemap.xml --filter-status=200
# This will only return responses that are not OK
$ crab crawl:sitemap https://domain.com/sitemap.xml --filter-status=!200
# This will only return responses between 500-599 (range)
$ crab crawl:sitemap https://domain.com/sitemap.xml --filter-status=500-599
# This will only return responses with 200 or 404 (multiple, be aware if one condition is true they all are)
$ crab crawl:sitemap https://domain.com/sitemap.xml --filter-status=200,404
# This will only return responses with a code greater than 500
$ crab crawl:sitemap https://domain.com/sitemap.xml --filter-status=>500

Save Url List in File

You can save the url list to a file

# This will save the output to a file called output.txt
$ crab crawl:sitemap https://domain.com/sitemap.xml --output-file ./output/output.txt

Save Output to JSON

You can save the output to a JSON file

# This will save the output to a file called output.json
$ crab crawl:sitemap https://domain.com/sitemap.xml --output-json ./output/output.json

License

MIT