/grab

Configurable Scraper & Downloader, Powered by RegExp and Go

Primary LanguageGoMIT LicenseMIT

GRAB GRAB

Greedy, Regex-Aware Binary Downloader

Stargazers Latest Release Codecov GitHub issues

Table of contents

Why

This project helps you automate scraping data and downloading assets from the internet. Based on Go's Regular Expression engine and HCL, for ease of use, performance and flexibility.

Installation

Download and install the latest release.

Usage

Run the following command to generate a new configuration file in the current directory.

grab config generate

Note
Grab's configuration file uses Hashicorp's HCL.
You can always refer to their specification for topics not covered by the documentation in this repo.

Once you're happy with your configuration, you can check if everything is ok by running:

grab config check

To scrape and download assets, pass one or more URLs to the get subcommand:

# single URL
grab get https://url.to/scrape/files?from

# list of URLs
grab get urls.ini

# at least one of each
grab get https://my.url/and urls.ini list.ini

Note
The list of URLs can contain comments, like the ini format: all lines starting with # and ; will be ignored.

Quickstart

The default configuration, generated with grab config generate already works out of the box.

global {
  location = "/home/yourusername/Downloads/grab"
}

site "unsplash" {
  test = "unsplash"

  asset "image" {
    pattern = "contentUrl\":\"([^\"]+)\""
    capture = 1

    transform filename {
      pattern = "(?:.+)photos\\/(.*)"
      replace = "$${1}.jpg"
    }
  }

  info "title" {
    pattern = "meta[^>]+property=\"og:title\"[^>]+content=\"(?P<title>[^\"]+)\""
    capture = "title"
  }

  subdirectory {
    pattern = "\\(@(?P<username>\\w+)\\)"
    capture = "username"
    from    = body
  }
}

For demonstration purposes, we can already download pictures from unsplash by using the following command:

grab get https://unsplash.com/photos/uOi3lg8fGl4

Warning
Please use this tool responsibly. Don't use this tool for Denial of Service attacks! Don't violate Copyright or intellectual property!

Internally, the program checks checks each URL passed to get, if it matches a test pattern inside of any site block, it will parse find all matches for assets or data defined in asset and info blocks. Once all the asset URLs are gathered, the download starts.

After running the above command, you should have a new grab directory in your ~/Downloads folder, containing subdirectories for each site defined in the configuration. Inside each site directories you will find all the assets extracted from the provided URLs.

The configuration syntax is based on a few fundamental blocks:

  • global block defines the main download directory and global network options.
  • site <name> blocks group other blocks based on the site URL.
  • asset <name> blocks define what to look for from each site and how to download it.
  • info <name> blocks define what strings to extract from the page body.

Additional configuration settings can be specified:

  • network blocks to pass headers and other network options when making requests.
  • transform url blocks to replace the asset URL before downloading.
  • transform filename blocks to replace the asset's destination path.
  • subdirectory blocks to organize downloads into subdirectories named by strings present in the page body or URL.

For a more in-depth look into Grab's confguration options, check out the guide.

Command Options

To get help about any command, use the help subcommand or the --help flag:

# to list all available commands:
grab help

# to show instructions for a specific subcommand:
grab help <subcommand>

get

Arguments

Accepts both URLs or path to lists of URLs. Both can be provided at the same time.

# grab get <url|file> [url|file...] [options]

grab get https://example.com/gallery/1 \
         https://example.com/gallery/2 \
         path/to/list.ini \
         other/file.ini -n

Options

Long Short Default Description
force f false To overwrite already existing files
config c nil To specify the path to a configuration file
strict s false To stop the program at the first encountered error
dry-run n false To send requests without writing to the disk
progress p false To show a progress bar
quiet q false To suppress all output to stdout (errors will still be printed to stderr).
This option takes precedence over verbose
verbose v 1 To set the verbosity level:
-v is 1, -vv is 2 and so on...
quiet overrides this option.

Next steps

  • Retries & Timeout
  • Network options with inheritance
  • URL manipulation
  • Destination manipulation
  • Improve logging
  • Check for updates
  • Display a progress bar
  • Add HCL eval context functions
  • Distribute via various package managers:
    • Homebrew
    • Apt
    • Chocolatey
    • Scoop
  • Scripting language integration
  • Plugin system
  • Sequential jobs (like GitHub workflows)

Credits

License

Distributed under the MIT License.