Validation of Required Contents
Closed this issue · 0 comments
ikreymer commented
We should add additional validation of components that must be in the WACZ, and print out what's missing when failing (if anything):
- At least one
*.warc
or*.warc.gz
inarchive/
- Either
indexes/index.cdx.gz
andindexes/index.idx
ORindexes/index.cdx
(uncompressed version) - If other
pages/*.jsonl
files exist, ensure they are all line-delimited json?- if any
pages/*.jsonl
exist, thenpages/pages.jsonl
must be one of them.
- if any
We could even make pages/pages.jsonl
always required, even if zero pages (just store the header). I guess could go either way on that..