/webarchive

Own webarchive service

Primary LanguageGoBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

Own Webarchive

Aimed to be a simple, fast and easy-to-use webarchive for personal or home-net usage.

Supported store formats

  • headers — save all headers from response
  • pdf — save page in pdf
  • single_file — save html and all its resources (css,js,images) into one html file

Requirements

  • Golang 1.19 or higher
  • wkhtmltopdf binary in $PATH (to save pages in pdf)

Configuration

The service can be configured via environment variables. There is a list of available variables:

  • DB
    • DB_PATH — path for the database files (default ./db)
  • LOGGING
    • LOGGING_DEBUG — enable debug logs (default false)
  • API
    • API_ADDRESS — address the API server will listen (default 0.0.0.0:5001)
  • UI
    • UI_ENABLED — Enable builtin web UI (default true)
    • UI_PREFIX — Prefix for the web UI (default /)
    • UI_THEME — UI theme name (default basic). No other values available yet
  • PDF
    • PDF_LANDSCAPE — use landscape page orientation instead of portrait (default false)
    • PDF_GRAYSCALE — use grayscale filter for the output pdf (default false)
    • PDF_MEDIA_PRINT — use media type print for the request (default true)
    • PDF_ZOOM — zoom page (default 1.0 i.e. no actual zoom)
    • PDF_VIEWPORT — use specified viewport value (default 1280x720)
    • PDF_DPI — use specified DPI value for the output pdf (default 150)
    • PDF_FILENAME — use specified name for output pdf file (default page.pdf)

Note: Prefix WEBARCHIVE_ can be used with the environment variable names in case of any conflicts.

Usage

1. Start the server

Start without docker

go run ./cmd/server/main.go

Change API address

API_ADDRESS=127.0.0.1:3001 go run ./cmd/server/main.go

Start in docker

docker compose up -d webarchive

2. Add a page

curl -X POST --location "http://localhost:5001/api/v1/pages" \
    -H "Content-Type: application/json" \
    -d "{
          \"url\": \"https://github.com/wkhtmltopdf/wkhtmltopdf/issues/1937\",
          \"formats\": [
            \"pdf\",
            \"headers\"
          ]
        }" | jq .

or

curl -X POST --location \
  "http://localhost:5001/api/v1/pages?url=https%3A%2F%2Fgithub.com%2Fwkhtmltopdf%2Fwkhtmltopdf%2Fissues%2F1937&formats=pdf%2Cheaders&description=Foo+Bar"

3. Get the page's info

curl -X GET --location "http://localhost:5001/api/v1/pages/$page_id" | jq .

where $page_id — value of the id field from previous command response. If status field in response is success (or with_errors) - the results field will contain all processed formats with ids of the stored files.

4. Open file in browser

xdg-open "http://localhost:5001/api/v1/pages/$page_id/file/$file_id"

Where $page_id — value of the id field from previous command response, and $file_id — the id of interesting file.

5. List all stored pages

curl -X GET --location "http://localhost:5001/api/v1/pages" | jq .

Roadmap

  • Save page to pdf
  • Save URL headers
  • Save page to the single-page html
  • Save page to html with separate resource files (?)
  • Basic web UI
  • Optional authentication
  • Multi-user access
  • Support SQL database with or without separate files storage
  • Tags/Categories
  • Save page to markdown