charlesroelli/org-board

Add WARC support

Opened this issue · 5 comments

xvrdm commented

Hi and thanks for the awesome library!

I was wondering if you were aware of this initiative:
https://github.com/gildas-lormeau/SingleFile

It has a CLI, so I guess it could be used as a backend for org-board.

Hi there, thank you for the link! I've not heard of SingleFile but it seems like a good fit for this package. I will look into adding support for it.

c1-g commented

Just throwing this out there. I manage to get org-board to work with another program called Monolith that, similar to Singlefile, saves a webpage in one html file. You can probably adapt this for the cli of Singlefile too.

Basically I override the org-board's org-board-wget-call to call my
own my/org-board-monolith-call instead.

(defun my/org-board-monolith-call (path directory args site)
  "Like `org-board-wget-call' but call monolith instead."
  (make-directory (file-name-as-directory directory))
  (let* ((filename (url-filename (url-generic-parse-url (car site))))
         (domain (file-name-nondirectory (url-domain (url-generic-parse-url (car site)))))
         (name (if (string-empty-p filename)
                   domain
                 (if (string-match "/$" filename)
                     (file-name-base (directory-file-name filename))
                   filename)))
         (output-directory-option
          (expand-file-name
           (concat (file-name-sans-extension (file-name-nondirectory name)) ".html")
           (file-name-as-directory directory)))
         (output-buffer-name "org-board-monolith-call")
         (process-arg-list (append (list "org-board-monolith-process"
                                         output-buffer-name
                                         path)
                                   org-board-wget-switches
                                   (list "-o")
                                   (list output-directory-option)
                                   args
                                   site))
         (monolith-process (apply 'start-process process-arg-list)))
    (if org-board-wget-show-buffer
        (with-output-to-temp-buffer output-buffer-name
          (set-process-sentinel
           monolith-process
           'org-board-wget-process-sentinel-function))
      (set-process-sentinel
       monolith-process
       'org-board-wget-process-sentinel-function))
    monolith-process))

(advice-add 'org-board-wget-call :override #'my/org-board-monolith-call)

Then I put these in my init.el

(setq org-board-wget-program (executable-find "monolith"))
(setq org-board-wget-switches '("-IevjF"))

The switches will be passed to monolith

@c1-g That works beautifully! Thanks.

GNU wget supports the creation of WARC archives, since 2012. See announcement at https://lists.gnu.org/archive/html/info-gnu/2012-08/msg00002.html

Given that org-board uses wget, can we get WARC support cheaply by using org-board's WGET_OPTIONS property?

I've just started using org-board (and org-attachments generally). WARC and WGET_OPTIONS is something I'm keen to try soon.

I'm skeptical about various other archive packages like SingleFile (which has already been forked...). I suppose it depends what you are looking for in a file format:

  • If you just a single file which can easily be copied or moved (shared as an email attachment, say) then take your pick: SingleFile and WARC both manage that.

  • If you're looking for web browser support, they're all poor choices IMO.

    • I'm unaware of any single-file archive format which is supported by common web browsers. Several browsers have devised their own format (e.g. MAFF) but none have caught on or been adopted by other browsers.
    • Some formats have 3rd-party browser extensions, which could be good for personal use. A downside here is that it doesn't really help when you want to share the archive with somebody else; they'll have to go and find a browser extension too.
  • If your interest is longevity though, then I'd bet on WARC. It's an ISO standard with a detailed spec, and it has the backing of major national libraries and universities. It's been developed and maintained with proper archivists and librarians, who tend to think on a longer time scale than most software developers I've known.