A web scraper built with Golang. It downloads the content of a website or blog and allows you to read it offline.
Features and advantages over existing tools like wget, httrack, Teleport Pro:
- Free and open source
- Available for all platforms that Golang supports
- JPEG and PNG images can be converted down in quality to save disk space
- Excluded URLS will not be fetched (unlike wget)
- No incomplete temp files are left on disk
- Downloaded asset files are skipped in a new scraper run
- Assets from external domains are downloaded automatically
- Sane default values
Limitations:
- No GUI version, console only
You need to have Golang installed, otherwise follow the guide at https://golang.org/doc/install.
go get github.com/cornelk/goscrape
goscrape http://website.com
Scrape a website and create an offline browsable version on the disk
Usage:
goscrape http://website.com [flags]
Flags:
--config string config file (default is $HOME/.goscrape.yaml)
-d, --depth uint download depth, 0 for unlimited (default 10)
-x, --exclude stringArray exclude URLs with PERL Regular Expressions support
-h, --help help for goscrape
-i, --imagequality int image quality, 0 to disable reencoding
-n, --include stringArray only include URLs with PERL Regular Expressions support
-o, --output string output directory to write files to
-u, --user string user[:password] to use for authentication
-v, --verbose verbose output
- github.com/gorilla/css css file tokenizer
- github.com/hashicorp/go-multierror multi error wrapping
- github.com/headzoo/surf virtual web browser
- github.com/PuerkitoBio/goquery HTML document traversal
- github.com/spf13/cobra command line handling
- github.com/spf13/viper configuration
- gopkg.in/h2non/filetype.v1 image format identification
- go.uber.org/zap logging