/goscrape

Web scraper that can create an offline readable version of a website

Primary LanguageGoMIT LicenseMIT

goscrape Build Status GoDoc Go Report Card codecov

A web scraper built with Golang. It downloads the content of a website or blog and allows you to read it offline.

Features and advantages over existing tools like wget, httrack, Teleport Pro:

  • Free and open source
  • Available for all platforms that Golang supports
  • JPEG and PNG images can be converted down in quality to save disk space
  • Excluded URLS will not be fetched (unlike wget)
  • No incomplete temp files are left on disk
  • Downloaded asset files are skipped in a new scraper run
  • Assets from external domains are downloaded automatically
  • Sane default values

Limitations:

  • No GUI version, console only

Install:

You need to have Golang installed, otherwise follow the guide at https://golang.org/doc/install.

go install github.com/cornelk/goscrape

Usage:

goscrape http://website.com

Options:

Scrape a website and create an offline browseable version on the disk

Usage:
  goscrape http://website.com [flags]

Flags:
      --config string         config file (default is $HOME/.goscrape.yaml)
  -d, --depth uint            download depth, 0 for unlimited (default 10)
  -x, --exclude stringArray   exclude URLs with PERL Regular Expressions support. You can use https://regex101.com/ to build them
  -i, --imagequality uint     image quality, 0 to disable reencoding
  -o, --output string         output directory to write files to
  -v, --verbose               verbose output