webrecorder/specs

Validation of Required Contents

Closed this issue · 0 comments

We should add additional validation of components that must be in the WACZ, and print out what's missing when failing (if anything):

  • At least one *.warc or *.warc.gz in archive/
  • Either indexes/index.cdx.gz and indexes/index.idx OR indexes/index.cdx (uncompressed version)
  • If other pages/*.jsonl files exist, ensure they are all line-delimited json?
    • if any pages/*.jsonl exist, then pages/pages.jsonl must be one of them.

We could even make pages/pages.jsonl always required, even if zero pages (just store the header). I guess could go either way on that..