/obelisk

Go package and CLI tool for saving web page as single HTML file

Primary LanguageGoMIT LicenseMIT

Obelisk

Go packages and CLI tool for saving web page as single HTML file


Obelisk is a Go package and CLI tool for saving web page as single HTML file, with all of its assets embedded. It's inspired by the great Monolith and intended as improvement for my old WARC package.

Features

  • Embeds all resources (e.g. CSS, image, JavaScript, etc) producing a single HTML5 document that is easy to store and share.
  • In case the submitted URL is not HTML (for example a PDF page), Obelisk will still save it as it is.
  • Downloading each assets are done concurrently, which make the archival process for a web page is quite fast.
  • Accepts cookies, useful for pages that need login or article behind paywall.

As Go package

Run following command inside your Go project :

go get -u -v github.com/go-shiori/obelisk

Next, include Obelisk in your application :

import "github.com/go-shiori/obelisk"

Now you can use Obelisk archival feature for your application. For basic usage you can check the example.

As CLI application

You can download the latest version of Obelisk from release page. To build from source, make sure you use go >= 1.13 then run following commands :

go get -u -v github.com/go-shiori/obelisk/cmd/obelisk

Now you can use it from your terminal :

$ obelisk -h

CLI tool for saving web page as single HTML file

Usage:
  obelisk [url1] [url2] ... [urlN] [flags]

Flags:
  -z, --gzip                          gzip archival result
  -h, --help                          help for obelisk
  -i, --input string                  path to file which contains URLs
      --insecure                      skip X.509 (TLS) certificate verification
  -c, --load-cookies string           path to Netscape cookie file
      --max-concurrent-download int   max concurrent download at a time (default 10)
      --no-css                        disable CSS styling
      --no-embeds                     remove embedded elements (e.g iframe)
      --no-js                         disable JavaScript
      --no-medias                     remove media elements (e.g img, audio)
  -o, --output string                 path to save archival result
  -q, --quiet                         disable logging
      --skip-resource-url-error       skip process resource url error
  -t, --timeout int                   maximum time (in second) before request timeout (default 60)
  -u, --user-agent string             set custom user agent
      --verbose                       more verbose logging

There are some CLI behavior that I think need to be explained more here :

  • The --input flag accepts text file that contains list of urls that look like this :

    http://www.domain1.com/some/path
    http://www.domain2.com/some/path
    http://www.domain3.com/some/path
    
  • The --load-cookies flag accepts Netscape cookie file that usually look like this :

    # Netscape HTTP Cookie File
    # https://curl.haxx.se/rfc/cookie_spec.html
    # This is a generated file! Do not edit.
    
    #HttpOnly_.google.com	TRUE	/	FALSE	1631153524	KEY	VALUE
    #HttpOnly_.google.com	TRUE	/ads	TRUE	1621062000	KEY	VALUE
    .developers.google.com	TRUE	/	FALSE	1642167486	KEY	VALUE
    
  • If --output flag is not specified then Obelisk will generate file name for the archive and save it in current working directory.

  • If --output flag is set to - and there is only one URL to process (either from input file or from CLI arguments) then the default output will be stdout.

  • If --output flag is specified but there are more than one URL to process, Obelisk will generate file name for the archive, but keep using the directory from the specified output path.

  • If --output flag is specified but it sets to an existing directory, Obelisk will also generate file name for the archive.

F.A.Q

Why the name is Obelisk ?

It's inspired by Monolith, therefore it's Obelisk.

How does it compare to WARC ?

My WARC package uses bolt database to contain archival result, which make it hard to share and view. I also think my code in WARC is not really easy to understand, so I often confused when I try to add additional feature or refactoring it.

How does it compare to Monolith ?

  • Both embeds all resources to HTML file, mostly using base64 data URL. The difference is Obelisk will use inline <script> and <style> for external JavaScript and CSS files. This is done because in many page the browser will struggles to load JavaScript that encoded into data URL. Inlining scripts and styles also make archival result smaller since we don't encode them using base64.
  • In Obelisk all request to external URL is disabled by default using Content Security Policy, while in Monolith we need to specify it manually. This is done because in my opinion archive shouldn't need and shouldn't be able to send request to external resources.
  • In Obelisk downloading assets are done concurrently. Thanks to this, Obelisk (most of the time) will be faster than Monolith when archiving a web page.

Why not just contribute to Monolith ?

  • I don't have any knowledge about Rust. I do want to learn it though.
  • I have a plan to update Shiori, so I need a Go package for archiving web page.

Attributions

Original logo is created by Freepik in theirs egypt and desert pack, which can be downloaded from www.flaticon.com.

License

Obelisk is distributed using MIT license, which means you can use and modify it however you want. However, if you make an enhancement for it, if possible, please send a pull request. If you like this project, please consider donating to me either via PayPal or Ko-Fi.