internetarchive/warctools

G-Zip Content-Length

Opened this issue · 0 comments

Warctools uses the Content-Length field to determine the length of the body for validating and reading WARC files. Since the g-zipped bodies are no longer g-zipped in common-crawl WARC files, not the whole of g-zipped messages is being parsed.
#14 fixes this and allows proper parsing common-crawl WARC files.