Batch content export
tokee opened this issue · 1 comments
tokee commented
At the Royal Danish Library there has been multiple requests for en masse exporting raw archive content, e.g. unmodified HTML, images or PDFs. The current exporter only supports WARC for this and for some researchers they can be cumbersome to work with.
SolrWayback should have an export option for a more common container format, where 64-bit zip is the obvious candidate as "all" platforms supports it out of the box.
The big question is how to handle naming for non-WARC export. Two options comes to mind:
- Best effort ala
timestamp/Filename_cleaned_of_non-ASCII_spaces_and_similar.ext
timestamp_hash.exe
with ametadata.txt
which containstimestamp, hash, WARC-file, WARC-offset, URL