iipc/jwarc

Leverage gzip extra field "sl" to skip over compressed WARC records

sebastian-nagel opened this issue · 0 comments

WARC writers may provide a gzip extra field "sl" (recommended by WARC 0.9 but dropped in newer versions) to encode the length of the compressed WARC record. This can be used to quickly skip over the current record for tasks (eg. CDX indexing) which do not require to read the payload. See also #14/#15.