netarchivesuite/solrwayback

Batch content export

tokee opened this issue · 1 comments

tokee commented

At the Royal Danish Library there has been multiple requests for en masse exporting raw archive content, e.g. unmodified HTML, images or PDFs. The current exporter only supports WARC for this and for some researchers they can be cumbersome to work with.

SolrWayback should have an export option for a more common container format, where 64-bit zip is the obvious candidate as "all" platforms supports it out of the box.

The big question is how to handle naming for non-WARC export. Two options comes to mind:

  1. Best effort ala timestamp/Filename_cleaned_of_non-ASCII_spaces_and_similar.ext
  2. timestamp_hash.exe with a metadata.txt which contains timestamp, hash, WARC-file, WARC-offset, URL
tokee commented

The looks like a duplicate of #245.