iipc/warc-specifications

Content-Length-GZIP header (More GZIP tricks)

bsdphk opened this issue · 32 comments

The 'trick' in Annex D.2 can be taken one step further:

Compress the header and body separately and add a "Content-Length-GZIP" header which records the length in bytes of the gzip'ed body.

This makes it possible to read all the headers of a WARC file, skipping over the bodies, without using an external index ... which is particularly convenient when you are (re)building an external index.

I implement this in my "AardWARC" software (https://github.com/bsdphk/AardWARC) and I can recommend this "trick" for inclusion in a future revision of WARC.

ato commented

I quite like that idea. I also recall at some point hearing someone talk about doing it by stuffing a length into the 'extra' field in the gzip header.

Compressing the headers separately also raises the possibility of delivering the payload as is compressed (when requested over HTTP with negotiated gzip) although the trailing CRLF is something of a headache.

It sounds like you're suggesting splitting the WARC + http headers into one gzip block and the payload into a separate gzip block? I think this might have made sense at the beginning, but at this point, it would be a breaking change to the standard. I think the standard should be more clear about the per record compression that is expected by all existing tools.

If I understand correctly, this optimization would mostly be useful if frequently reindexing the WARC so that the file could be scanned without reading the payload. But most web archiving use cases index the WARC once, and then use the index from that point on, so this optimization would be of little use for the replay use case.

If the HTTP headers are placed in a separate block than the payload, that just means the replay would have to decompress 2 gzip records instead of 1 for every response. The only reason to read the payload without HTTP headers might be range query responses, but that would still be difficult unless the payload itself is uncompressed.

Also, since the indexing format is unspecified in the WARC standard, there is nothing preventing the creation of an index format that includes a pointer to the beginning of the record and a pointer to the payload, if that were necessary. When reindexing frequently, one could also use the previous index (unless it were corrupt) to get the same benefit.

Perhaps I'm misunderstanding what is being proposed, but it seems to be an incompatible change without a clear benefit to the standard.

If anything the standard should be amended to clearly specify the per-record compression -- the WARC headers + http headers + payload should be compressed in the same gzip block, as that is the expectation of most (all?) current web archiving tools.

ato commented

I should note for context that @bsdphk's AardWARC is not actually a web archiving tool and doesn't deal with HTTP headers (it only uses resource and metadata records) so the original proposal here likely only refers to the WARC headers and not HTTP headers, although an extension to payloads for response records seems a logical consideration.

there is nothing preventing the creation of an index format that includes a pointer to the beginning of the record and a pointer to the payload

There's limited benefit to doing that for a compressed record as you can't seek within them and have to process all the data from the beginning anyway.

it seems to be an incompatible change without a clear benefit to the standard.

That's one thing I prefer about the alternative approach I mentioned of putting the compressed length into the gzip header, as it allows for the fast indexing optimisation in a perfectly backwards compatible way without any change to the member boundaries. You can decompress only as much as you care about of the WARC and HTTP headers and then skip to the next record.

Just to clarify: I do not think my way of compressing the objects violate the standard, it is inherent in how the gzip format works.

In AardWARC I do indeed put the GZIP'ed length into extra headers, both for quick skipping, and as @ato points out, to be able to deliver the gzip'ed content directly.

(I deal with the trailing crnlcrnl by compressing it separately too, but since this is constant and static, I simply have the 24 byte compressed sequence as a constant in my source code.)

The gzip'ed length of the content goes into a "Content-Length-GZIP: %u" header, and for segmented objects I have also added a "WARC-Segment-Total-Length-GZIP: %u" header.

I recommend both headers as optional features to a future revisions of the WARC standard.

Our WARC use-case is not the "usual", but I think we are well inside the four corners of the specification, and apart from a couple of minor annoyances[1] WARC works fine for us.

The main difference is that in our case data-loss is catastrophic, we cannot bet on an earlier or later web-harvesting to have a (near-identical) copy, so integrity checks and auditing are very high priority for us.

[1] It is weirdly inconcistent that a segmented object has the WARC-Payload-Digest in the first segment, but the total length in the last. I propose that total length be allowed also in the first segment, and that WARC-Payload-Digest can be put in the first, last or both segments.

I think @ikreymer is saying that he thinks this would be a breaking change in the WARC standard, not the GZip standard.

Currently the standards and all the other web archiving tools make certain assumptions about WARC files. Therefore, in order to be usable, we'd need some way of easily distinguishing these two approaches at parse time.

The WARC spec. does not require a warcinfo record, so we can't rely on putting a profile marker up there.

What WARC-Type are you using for these records? I don't think your approach is consistent with any of the current record type definitions, but new record types might work. e.g. instead of WARC-Type: response using WARC-Type: response-headers and WARC-Type: response-body? Using new record types also means implementations will safely ignore records they can't understand.

Aside from the standardisation, I'd be curious if you have any benchmark data that indicates the advantages of this approach?

I only use warcinfo, metadata and resource.

A single warcinfo at the front of each file and nowhere else.

I have uploaded a small test-archive here:

http://phk.freebsd.dk/misc/Aardwarc

The WARC files are under the '0' subdirectory, I would be very interested to hear if your tools have any problems with them

PS: If the archiving tools make assumptions about how content is gzip'ed, I would argue that should be in the standard rather than being a "ambient requirement" people "just have to know".

Yes, to clarify a bit what I was saying: Most currently known WARC tools (those listed in https://github.com/iipc/awesome-web-archiving) work on the assumption that a single WARC record == one gzip block. That is, everything starting from WARC/1.0 upto (but not including) the next WARC/1.0 is one gzip block. The next gzip block starts with the a WARC/1.0.
I think that any WARC file that does not do this will be considered invalid by most existing tools currently.

However, all that the standard says is, in an addendum:

D.2 Record-at-time compression
As specified in 2.2 of the GZIP specification (see [RFC 1952]), a valid GZIP file consists of any number of
GZIP "members", each independently compressed.
Where possible, this property should be exploited to compress each record of a WARC file
independently. This results in a valid GZIP file whose per-record subranges also stand alone as valid
GZIP files. 

Given the de-facto stronger standard imposed in the tooling, this should probably be made more clear. I can see how this could be interpreted to mean multiple gzip blocks per record, when in reality it should really be one gzip block per record to work with existing tools.

I have recently seen another example of WARCs that did not follow this at a memory institution, and there was confusion as to why they did not work with any existing tools, and this is the reason (multiple gzip blocks per record).

I'd be very wary of creating more tools that do this unless absolutely necessary as it will not work with other tools. If it is absolutely necessary, there should probably be a clear way to specify that a WARC record is split into multiple gzip blocks, and a future version of the standard should include this spec and make it clear that otherwise its one block == one gzip record.

(@bsdphk there is currently a permission issue on downloading the WARC files, returns 403 -- will try them out when fixed)

I'm sorry, but if you read the gzip RFC, then your tools are violating the GZIP spec: The tools should keep reading and gunzip'ing until the higher level protocol (WARC in this case) is satisfied.

It would be a really bad idea to start doing heart surgery on the GZIP standard inside the WARC specification.

And for reference: I'm giving you that advice from 30+ years experience implementing badly written standards in all sorts of contexts: You have a finite amount of data from the past, but will have a much larger amount of data from the future, and you should focus on the bigger of the two and always get it right, no matter how late in the game it seems to be.

(Permissions should be fixed now)

Yes, fair point, much of our tooling is likely not complying with the GZip spec. I think the tools and the WARC standard annex can be brought in line as long as we are okay to assume the start of each WARC record will be on a gzip member boundary. e.g, indexers could ensure they emit a record when they hit member-boundary+WARC/1.0 and read across other member boundaries otherwise.

That said, most of the tools also assume you're storing HTTP transaction request/response records (as did I!), and won't be much use when using WARC to store non-web material (this is an under-developed use case -- until now I was only aware of @elzikb's working with WARC like this. See also #30). I guess JWAT-Tools and warcio could be expected to support this use case?

@bsdphk I can't download those files - I just get the uncompressed first record:

$ curl http://phk.freebsd.dk/misc/Aardwarc/0/00000000.warc.gz
WARC/1.1
WARC-Record-ID: <file:///home/phk/Proj/AardWARC/tests/4951/4af2e63b972c35efb125bcb3385fb17c>
Content-Length: 256
Content-Length-GZIP: 279
Content-Type: application/warc-fields
WARC-Block-Digest: sha256:4af2e63b972c35efb125bcb3385fb17cffbe52234c33fd0b20ef79ee687eac20
WARC-Date: 2018-11-25T21:20:02Z
WARC-Filename: 00000000.warc.gz
WARC-Type: warcinfo

I was just about to add, unfortunately, gzip multi-member support is a somewhat obscure and not well supported part of the gzip spec. Many widely used tools only decompress the first member and stop there.
For example: https://bugs.chromium.org/p/chromium/issues/detail?id=20884

If a user clicks on those links in Chrome, only the first member will be downloaded and automatically decompressed, because of how Content-Encoding: gzip is handled. One has to be careful when making compressed WARC files available for download to make sure they are actually download as expected.

For me, curl and wget seem to do the right thing (on OSX), but I wouldn't be surprised if different versions messed up the download

We have a fairly intrusive web proxy here unfortunately. Looks like it's interfering and messing with the download -- works fine if I download it from elsewhere. That's nasty.

I can confirm that warcio would error on trying to process these WARCs. However, there is now a recompress command that can convert these WARCs into our standard one record == one gzip member format.
As an experiment, I ran this on all the files:

warcio recompress ./orig/$i.warc.gz ./fixed/$i-fix.warc.gz

which converted all the WARCs into our canonical format.

It turns out the one member==one record format results in all the WARCs being smaller.
The original directory has a size of 752516, while the 'fixed' WARCs add up to a total of 714058

Of course, I know this is a small, contrived example, but still a point to consider since we are talking about compression -- these non-standard WARCs are actually larger than what would be achieved using the standard practice.

File sizes of the 'fixed' WARCs:

-rw-r--r--   1 ilya  wheel  13960 Dec 12 08:31 00000000-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14855 Dec 12 08:31 00000001-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14235 Dec 12 08:31 00000002-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14370 Dec 12 08:31 00000003-fix.warc.gz
-rw-r--r--   1 ilya  wheel  13898 Dec 12 08:31 00000004-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14904 Dec 12 08:31 00000005-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14152 Dec 12 08:31 00000006-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14187 Dec 12 08:31 00000007-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14157 Dec 12 08:31 00000008-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14574 Dec 12 08:31 00000009-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14830 Dec 12 08:31 00000010-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14403 Dec 12 08:31 00000011-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14548 Dec 12 08:32 00000012-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14264 Dec 12 08:32 00000013-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14333 Dec 12 08:32 00000014-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14868 Dec 12 08:32 00000015-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14921 Dec 12 08:32 00000016-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14157 Dec 12 08:32 00000017-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14182 Dec 12 08:32 00000018-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14164 Dec 12 08:32 00000019-fix.warc.gz
-rw-r--r--   1 ilya  wheel  15043 Dec 12 08:32 00000020-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14478 Dec 12 08:32 00000021-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14494 Dec 12 08:32 00000022-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14455 Dec 12 08:32 00000023-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14301 Dec 12 08:32 00000024-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14502 Dec 12 08:32 00000025-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14424 Dec 12 08:32 00000026-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14224 Dec 12 08:32 00000027-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14747 Dec 12 08:32 00000028-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14255 Dec 12 08:32 00000029-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14638 Dec 12 08:32 00000030-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14607 Dec 12 08:32 00000031-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14553 Dec 12 08:32 00000032-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14354 Dec 12 08:32 00000033-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14635 Dec 12 08:32 00000034-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14468 Dec 12 08:32 00000035-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14189 Dec 12 08:32 00000036-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14371 Dec 12 08:32 00000037-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14470 Dec 12 08:32 00000038-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14707 Dec 12 08:32 00000039-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14675 Dec 12 08:32 00000040-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14288 Dec 12 08:32 00000041-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14584 Dec 12 08:32 00000042-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14615 Dec 12 08:32 00000043-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14533 Dec 12 08:32 00000044-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14574 Dec 12 08:32 00000045-fix.warc.gz
-rw-r--r--   1 ilya  wheel  14174 Dec 12 08:32 00000046-fix.warc.gz
-rw-r--r--   1 ilya  wheel  12598 Dec 12 08:32 00000047-fix.warc.gz
-rw-r--r--   1 ilya  wheel   9983 Dec 12 08:32 00000048-fix.warc.gz
-rw-r--r--   1 ilya  wheel  12157 Dec 12 08:32 00000049-fix.warc.gz

Hi, All. I've been following along—great discussion. Maybe this is just a matter of improving the wording?

D.2 Record-at-time compression
As specified in 2.2 of the GZIP specification (see [RFC 1952]), a valid
GZIP file consists of any number of GZIP "members", each independently
compressed. Where possible, this property should be exploited to
compress each record of a WARC file independently. This results in
a valid GZIP file whose per-record subranges also stand alone as
valid GZIP files.

To me, this has always read as "compress each record of a WARC file" and append them, which conveniently takes advantage of the feature of GZIP that supports multiple GZIP members.

I don't see how this breaks the GZIP spec. Can you point to something specific? Is there a gzip tool that cannot unpack a WARC file?

@siznax I agree with this as well, and think perhaps the wording should be improved to specify that it should be one record == one gzip member.

The initial suggestion was splitting a WARC record into two (or more?) gzip members, and @bsdphk as I understand, was arguing that tools should support WARCs that contain records split into arbitrary number of gzip members.

I don't think I agree with this. The WARC format can be more stringent as to how it uses GZIP.
A valid gzip-compressed WARC is also a valid multi-member GZIP, but any multi-member GZIP is not necessarily a WARC! For example, one could compress each byte of a WARC record into its own gzip member and cat them together, and that would still be a valid multi-member gzip, but it certainly is not (and should not imo!) be a valid WARC file!

@anjackson: sorry, that's partly my webserver being silly about .gz suffixes. I have created a 0.tar file also.

I'm still very interested to hear if any of your tools (after recompression) flags problems with the files.

@ikreymer: Yes, any kind of discontinuation of the gzip dictionary causes loss of compression efficiency. Any and all of the schemes discussed here results in the WARC headers being compressed very poorly, because they always encounter an empty gzip dictionary, and therefore cannot piggyback on previous records.

If you run the 00000000.warc.gz straight through gzip -9 you get only 8787 bytes, largely because the WARC headers can be properly compressed also.

If you want to improve compression of WARC headers, while still being able to start reading at each WARC record, you will have to dive into the libz and create a (standardized) default dictionary which contains good initial targets (ie: WARC and HTTP header names etc.)

On general principles I will caution strongly against doing that, but I will also be honest and tell you that you can get very significant compression improvements by doing so.

The HTTP/2 HPACK dictionary is both a good example of the kind of gains you can make and a horror example of how not to create the dictionary in practice.

@siznax: I don't advocate allowing or encouraging crazy freeform gzipping, but since Annex D is only informative, a file gzip'ed byte by byte is a valid WARC file as far as the current document goes.

I don't think you should change that just because the current tools have been written a particular way.

In particular, I can easily see why people might recompress WARC files to one single gzip element for space reasons, or use one of the newer even more efficient compression algorithms, and there is always the option to "recompress" to make things compatible.

The main advantage to the (header CRLF)(block)(CRLF CRLF) scheme I use, is that you can deliver individual elements of the WARC silo over HTTP without decompressing (and optionally recompressing for Content-Encoding) the (block) part. If your silo contains gigabyte-sized images of harddisks, like ours will, that matters a LOT.

Let me finish with a little bit of context about our application of WARC:

Datamuseum.dk is a voluntary organization trying to preserve IT-history in Denmark until one of the existing museums step up to the plate or we become one ourselves. We have about 400 members paying a small anual fee and about 20 more or less active members.

Our organization may cease to exist on a timescale as short as a few months: We have no money and our rooms are kindly lent to us by the local city council. If they need them for something else, we're done for. Therefore it is important for us that the bit-archive has a shape which allows any successor organization, now or much later, to access the content.

Many of the digital artifacts we will be storing are at great risk of loosing their physical origin, desintegrating magtapes, rusting harddisks, you know the list, and therefore integrity and robustness were very high priorities for our selection of file format.

WARC was pretty much the only format I could find which had thought about integrity checksums at the record level, segmentation, metadata and so on. Being an ISO standard counted as a big plus.

Finally, that our friends at the Danish Royal Library are WARCists means that if all else fails, we can and will drop a pile of WARC-file foundlings on their doorstep, and hope they preserve them for the future. (Don't tell them I said that. :-)

I realize this is pretty far from the median WARC use, but there are a lot of computer preservation activities booting up around the world right now, and they will all face this task. I hope, as one always does, that my AardWARC software will help that process, but time will show how that goes.

@bsdphk: Given the use case of archiving large disk images: is there any reason why the images are not compressed on payload level? Just as it would be for web captures with "Content-Encoding: gzip" or binary formats with internal compression. The savings for an additional silo-level compression are usually marginal and you've already mentioned the "headache" with the trailing \r\n\r\n.

@bsdphk I'll be a more blunt here: I think you are creating invalid WARC files, which you are calling 'WARC' but which will be considered invalid by all existing open source WARC tools.

I'm still very interested to hear if any of your tools (after recompression) flags problems with the files.

The recompress tool that I have written is designed to 'fix' such invalid WARC files and convert them into valid ones, so yes, after the extra step of fixing them, they should be valid. But the recompression requires reading the entire file, decompressing it, and recompressing it, invalidating any existing indexes.

The '(header CRLF)(block)(CRLF CRLF)' is a custom compression format that you've chosen, which while a valid gzip, is not a valid WARC per the intent and consensus interpretation of the standard (and standard can be improved to clarify this).

To work with the files you create, WARC tool creators will have to implement your chosen compression scheme, because the WARC standard has been intended to mean one record == one gzip member. Your compression scheme is not documented anywhere, (does the presence of Content-Length-GZIP header imply (header CRLF)(block)(CRLF CRLF) format?)
and will have to figure out how to read these WARC files later.

Worse, future users will be frustrated because they encounter files that are called WARC, but can not be read by other WARC tools that are out there, and it will not be clear why. Why not just call them 'AardWARC files' to differentiate from standard WARCs?
I think there's no point in conforming to a standard if you choose to interpret it differently than everyone else.

The main advantage to the (header CRLF)(block)(CRLF CRLF) scheme I use, is that you can deliver individual elements of the WARC silo over HTTP without decompressing (and optionally recompressing for Content-Encoding) the (block) part. If your silo contains gigabyte-sized images of harddisks, like ours will, that matters a LOT.

Serving WARC records over HTTP is a common use for WARC files, and this is already done with the standard one-gzip-member-per-record WARCs. As @sebastian-nagel mentions, the compression can be just the payload and you can then stream the data as is, and the record headers themselves could be plaintext, creating an uncompressed' WARC record. Or, you could double compress them (both the payload, and the entire record) to get the additional benefits from WARC header compression, but probably not much of a difference in space for large disk images.

There are many, many petabytes of data with one-gzip-member-per-record, many accessed in this way over HTTP, and more being created.

I was not involved in the creation of the WARC standard, but it is what we have now. I think the changes you suggest are marginal improvements compared to the headache of having incompatible WARCs, or the effort to support this new compression scheme (especially when there may not be any space improvements at all).

I would really urge to reconsider creating incompatible WARCs when you could just follow the standard practice.

Well put, @ikreymer. The WARC standard was written—and carefully shepherded through the standardization process—in order to help the web archiving community: a large, global community of memory institutions, the IIPC. No specification is perfect and it is not reasonable to expect this spec to conform perfectly to its derivatives.

@bsdphk, I don't want to put you off, but please stay focused and help our community improve this standard. You can point out deficiencies, add caveats, and even bend the specification(s) to your purposes, but asking the community to retroactively conform to your use case does not seem helpful.

The community depends on the standard to achieve the mission of web archiving. It would be most helpful if you can work with community members toward that goal.

If this standard simply isn't sufficient for your purposes, then please consider creating a new standard which may depend on this one. If we can adjust this standard to help both our community and your use case, then that may be worthwhile, but it is up to the community.

@ikreymer can you recommend an IIPC meeting/forum where @bsdphk can voice his concerns and help improve our standard moving forward?

I think you are creating invalid WARC files, which you are calling 'WARC' but which will be considered invalid by all existing open source WARC tools.

I disagree. A valid WARC file is one which does not violate the specification. If existing tools don't support it, that just means they have additional restrictions, but it doesn't make it an "invalid WARC file".
As @bsdphk pointed out already, annex D is only a recommendation, i.e. files which do not follow that annex are not violating the standard. Therefore, a file which for example compresses each byte as an individual gzip member is just as valid a WARC as one which compresses the entire file as one gzip stream or one which follows the recommendation and compresses each record.
Of course, that does not mean that doing so is a good idea – in fact, I completely agree with you that it isn't – but technically, these are still valid WARCs.

My feelings on this may not be as strong as other people's. But I don't find the rationale for this unusual compression scheme all that compelling:

@bsdphk commented 25 days ago

This makes it possible to read all the headers of a WARC file, skipping over the bodies, without using an external index ... which is particularly convenient when you are (re)building an external index.

I'm pretty sure you could achieve the same thing by putting the length of the gzip record in the gzip header. (When you start writing the gzip record, include an "extra field" for the length, with enough bits for the longest record you could possibly conceive of, say 64 bits, and go back and fill it in after you finish writing the record.) This way you have a solution for your use case without invalidating the whole ecosystem of existing tools. And then we can tighten up the spec.

(edit) Ah just noticed @ato more or less suggested this approach right off the bat

I also recall at some point hearing someone talk about doing it by stuffing a length into the 'extra' field in the gzip header.

Oh and @bsdphk is already doing it, too!

In AardWARC I do indeed put the GZIP'ed length into extra headers, both for quick skipping, and as @ato points out, to be able to deliver the gzip'ed content directly.

Delivering gzip'd content directly is a more convincing use case. Another approach to that problem that might be less disruptive would be for WARC to allow Content-Encoding: gzip in the record header. Then you could write uncompressed warcs with the payloads compressed.

These WARCs are technically valid according to the specification, but IMO standards should reflect consensus rather than lead it ("rough consensus and running code" and all that).

While the tools could perhaps be updated to read this format, I am not in a position to judge how easy that is. Even if it's straightforward, I agree with others that it will add to confusion around an already confusing situation. Hence, I'm wary of introducing a new practice that is incompatible with the existing tooling without good reason.

However, this seems to be a intrusive optimisation in order to speed up the handling of a rare event (most of us re-index very rarely), and this thread has documented two different ways of achieving similar benefits while remaining compatible with the existing tools. Uncompressed WARCs with compressed payloads seems like a perfect acceptable approach.

So, right now I'm leaning towards formalising the one-gzip-member-per-warc model as the 'official' compression scheme, and perhaps noting this alternative scheme as part of a separate informative section about using the WARC standard as a generic packaging format (with appropriate caveats around gzip-spec compliance, tool support, etc.)

(BTW, I can't find anything in the gzip RFC that says anything anywhere near as clear as:

The tools should keep reading and gunzip'ing until the higher level protocol (WARC in this case) is satisfied.

...but happy to be corrected if I've missed the point, or if there's a different source.)

@ikreymer I hate to be blunt, but as far as ISO 28500 goes, Annex A is only informative, you don't get to judge my WARC files invalid, unless you can point to the normative text I violate.

You can say that you don't like them, that you think they should be outlawed in future versions of the standard, that you will personally keel-haul me for having committed this herecy, but you cannot call them invalid: That's simply not how standardization works.

@anjackson: If you wanted "rough consensus and running code" you should have tried to standardize under IETF instead of ISO, because that is not at all the way ISO operates.

When ISO says "normative" they bloody well mean it, and that is one very big reason why their standards, but not RFCs, get incorporated into treaties and legislation.

I think it is perfectly sensible to recommend the "canonical" compression you prefer in Annex A, but I don't think you should mandate it just because your current tools overlooked what, to be honest, a lot of people overlook in RFC1952 section 2.2:

  A gzip file consists of a series of "members" (compressed data
  sets).  The format of each member is specified in the following
  section.  The members simply appear one after another in the file,
  with no additional information before, between, or after them.

@nlewitt Yes, endless amounts of fun can be had playing with the internals of gzip files.

If we meet some day over beer I can regale you with tales of how I stitch one gzip file into the middle of another gzip file, possibly at depth, in the Varnish Cache HTTP accellerator :-)

(Or see for yourself: at line 584ff https://github.com/varnishcache/varnish-cache/blob/master/bin/varnishd/cache/cache_esi_deliver.c)

But one very major focus in my writing of AardWARC is that it should be simple and robust software, so that if need be, it can be pulled out of storage 50 or 100 years from now, and be made to work. For that reason I prefer to use only the minimal subset of the libz API I can get away with.

One advantage of putting it in the gzip header, is that it does not "pollute" the WARC headers with implementation specific information. (If the WARCfile was subsequently recompressed with a better
gzip implementation, the headers will need updating. That would not be the case if the length lives in the gzip (over)head.

I will ponder that a bit.

@sebastian-nagel thats one of many interesting questions we have yet to answer about this entire archival exercise :-)

Fundamentally I think it turns into a tooling issue more than anything else: If we compress artifacts at the payload level, all tools retrieving artifacts need to be able to handle both compressed and uncompressed artifacts. (Yes, the irony of the parallelism to @ikreymers agument above is not lost on me :-)

Less important, but also annoying is that the integrity digest in the WARC header no longer represents the "native" artifact, but the compressed version of it, which again means that duplicate detection only works if people are really careful with their gzip arguments.

So all in all I lean that we archive in the "native" format, what ever that is for that kind of artifact, and trust the gzip'ing of the WARC files to save disk space for us - but practical experience can easily change that guidance.

@siznax I'm not "retroactively asking you to conform to my use-case", I'm pretty sure my WARC files are standards compliant, and I don't need you to do anything for me.

Pleae reread my original submission: I was suggesting how you can make the standard more capable and machine efficient in future revisions, by providing a standardized way to communicate the gzip'ed length of the contents.

If that suggestion does not muster quorum, whatever that means here, then fine: I don't need these features to be in the WARC standard in order to use them, they can be a private extensions to AardWARC because they can be 100% ignore as far as the "WARC-iness" of the silos go.

And the WARC standard is plenty sufficient for our purpose, it is even well-suited to our purpose, because somebody was foresighted enough to include the "metadata" and "resource" record types for exactly this kind of use-case.

As for improvements and suggestions: I'll happily contribute, but I try to avoid air-travel as much as I can for climate reasons, but if you have a meeting somewhere near Denmark...

One thing you may want to think about is reserving all headers starting with "WARC" for future versions of the standard. The standard sort of promises to always use that prefix, but it doesn't preclude everybody else from defining WARC- headers too, so you could risk your future "WARC-frobozz" to be already taken by somebody else.

Anyway, I think this issue is ripe for closing, my idea clearly does not resonate, and you should not waste more time on it.

Thanks for your consideration.

@bsdphk @JustAnotherArchivist I'll gladly concede that this type of compression is 'valid' as far as the ISO standard, due to the ambiguity and non-normative nature of the annex.

I meant 'invalid' for practical purposes, as far as all other existing tools go, and from a perspective of a user trying to read the WARC. The way I see it, if I create a file in a specific standardized format, and all other tools that claim to support said format are unable to read the file without changes, I would conclude that my file is in fact 'invalid', and would try to make changes to be more compatible.

I agree with @anjackson regarding consensus, but definitely see the point regarding ISO.

It seems like there is consensus that an improved definition of gzip compression is needed in the standard, so that's a very positive outcome of this discussion!

@sebastian-nagel thats one of many interesting questions we have yet to answer about this entire archival exercise :-)

Fundamentally I think it turns into a tooling issue more than anything else: If we compress artifacts at the payload level, all tools retrieving artifacts need to be able to handle both compressed and uncompressed artifacts. (Yes, the irony of the parallelism to @ikreymers agument above is not lost on me :-)

Well, sort of, but this actually already happens to be in common use :)

I think what @sebastian-nagel, @nlevitt and others may be alluding to but perhaps is not immediately apparent is that:

  1. WARC records themselves can be uncompressed and concatenated together, and this is generally supported by many tools. The many WARC tools that I've seen typically support both of these two options: Uncompressed WARC records concatenated and gzip-member-per-record concatenated. But unfortunately, this common practice is not really obvious from the standard alone. The tools can parse the WARC headers and return the payload as is, leaving the user to decide what to do with headers + payload.

    In fact, if you take any of the WARC files you have, and instead of running them through warcio recompress, you run them through just zless, I think you'll get "valid" WARCs readable by existing tools. Perhaps I should have added that as well initially :)

  2. For HTTP response records, it is already common practice to decompress the payload if Content-Encoding: gzip is found in the HTTP headers, especially for HTML content, as this then allows web archive systems to 'rewrite' the payload on the fly for purpose of web archive replay. The idea of decompressing the payload could thus easily be extended to also apply to resource / metadata records if Content-Encoding: gzip is added as a WARC header. Depending on the architecture, this may be done in a downstream web archiving specific library, or in lower-level warc reader itself (warcio supports payload decompression directly).

While it may seem like a special case (handling compressed and uncompressed payloads) this happens to already be a common use case!

This is probably all useful information that should make it into some future guide, if not a future revision, so that it can be useful to others trying to make sense of WARC standard and how it has been implemented and applied. Thank you @bsdphk for helping bring this discussion to light!

Though IIRC, it didn't make the ISO standard, the original WARC idea (based on a practice in pre-WARC Alexa ARC files) was to include a gzip extra field allowing skip-to-next-member: 'sl'. See the WARC-0.9 draft:

http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html#anchor25

I think some of IA's WARC-writing tools of a certain generation may still write this field.

@gojomo

That's not a bad idea, and you should certainly consider including it in a future version of the standard again, at least as informative.

Did the field ever get registered with Mark ?

However: 4 byte length fields ? Really ? With the recommended 1GB silo size that means that the uncompressed field overflows at a 75+ compression rate, which is not at all uncommon.

@ikreymer

Can I suggest you use the word "uncommon" or "atypical" instead of "invalid" ?

Yes, the compressed/uncompressed thing is a known issue, but ours have a couple of extra layers to it because of the museum-environment. For one thing, the compression we receive may not be decompressible, unless of course, you happen to have the 'STUFF' program from the CDC6500, and the CDC6500 to run it on? :-)

Fortunately my main focus, at least right now, is only on the reliable storage-engine of the bitarchive.

ato commented

@gojomo Ah! That's useful to know. I've definitely seen obscure code references to it but never an actual description.

@bsdphk At the time I think the standard silo size was 100MB and the limitation probably wasn't so glaringly obvious when they first devised it some 20 years ago with the 32-bit size field in the gzip trailer as prior art. :)

@ato: That's why I like computer archaeology as a hobby: Our oldest running computer has 1024 words of core, 40 bits each. (http://datamuseum.dk/wiki/GIER/K%C3%B8rende)

Af any off you ever come by Copenhagen and want to see old computers, just drop me a note :-)

@bsdphk I see what you mean regarding custom compression formats, I think perhaps there were different ideas suggested.

I was still referring to adding gzip compression of the payload only upon storage in the silo.

Based on that, and after re-reading the original proposal, I wanted to mention the conclusion of that idea.

For each record, I think that if you replace:

gzip(header CRLF)gzip(block)gzip(CRLF CRLF)

with:

(header CRLF)gzip(block)(CRLF CRLF)

where the header and trailing CRLF CRLF are not gzipped at all, while the payload is still gzipped as before, you can get an uncompressed WARC record with a compressed payload.

I think this should be compatible with existing tools and still have the properties you are looking for.

There is then no longer a need for a Content-Length-GZIP header as the standard Content-Length header gives you what you want. To seek to the next record after reading the header, simply seek by content length + 4

Since the payload is compressed, this could be identified with a Content-Encoding: gzip as has been suggested.

But, the WARC-Block-Digest would then refer to the digest of the compressed content.
If this is problematic, eg. if there is a concern about inconsistent gzip compression leading to mismatching digests, a new header could be added.
For example,WARC-Block-Uncompressed-Digest (or something shorter) could then be used. When combined with Content-Encoding: gzip header, this header would indicate the digest of the content after decompression.

Of course, the downside is that the WARC headers are no longer compressed at all (and neither is CRLF CRLF, so you get 20 bytes back!), but if most of the payload is multi-GB to begin with, that may not be as important a factor.

@ikreymer

I considered some variants like that originally, partly because of an annoynce having to do with reserving space for segmentation headers for the first "real" object in each silo. (Not important, just an implementation detail.)

But mixed raw/gzip formats like the one you propose make brute-force sequential searches much slower.

When the silo-files are valid gzip files, you can run zcat 0000000.warc.gz | grep -ab GODOT and translate the resulting byte offsets into WARC identifiers via the auxillary index.

On modern hardware this is fast enough to be usable for research purposes, and you can trivially parallelize the search on multi-core systems.

Compression efficiency of the WARC headers or the CRLFCRLF is a non-issue in our case. Our smallest objects are paper-tapes ranging from 100 to approx 12000 bytes, but I expect our median object to be in hundreds of kilobytes, growing over time.

Exactly what we will do when we get to GB harddisks is an open question, but one of the reasons I picked the WARC format, is that the segmentation support keeps our options open.

@bsdphk

I see I inquired to the address in RFC1952 (gzip@prep.ai.mit.edu) about registering sl (0x73, 0x6C) in September 2006. I don't appear to have received a reply. I can't easily find anyplace which purports to the be the up-to-date, canonical registry of Gzip extra-fields, so unsure if the request had any effect.

I think we justified 4-byte fields with the idea that anyone trying to make a longer-than-4GB gzip-member could, and probably should, just end that member and start another. (And, such an extra restart in giant data would have negligible impact on compression rates.) You can also see in the WARC-0.9 writeup the idea that the sl isn't necessarily a hint guaranteed to find the next member, just a next member – it might skip over multiple members that were themselves concatenated, preserving that aspect of Gzip. (I think we were also floating the idea that every WARC could have as its last record some index of its earlier-record start-offsets, so many WARC consumers wouldn't have to build their own alongside-indexes before random-access. But, I can't recall to what extent that ever got pursued.)

I remember being frustrated that the closed-source C Alexa-era ARC tools assumed members mapped 1:1 with the next-format-level-up records – relying on that as an 'envelope' seemed wrong to me then, so it's also a bit discouraging that it might be a widespread practice in current WARC tools. Still, if that's so, in the interests of increasing-interoperability-over-time, some action seems justified to either tighten the spec or encourage the software, when revised, to stop relying on that.

Pssst, I sent an e-mail to the RFC people to ask about maybe restarting the gzip registry. There could be a chance of sl being a registered thing still.

Doing it on the WARC level instead of the gzip level (as originally proposed) does have another benefit: brotli content-encoding. Brotli has allowance for adding metadata blocks (way less structured than gzip), but bit-shifting over a whole stream is required to do that to an existing file, because the format is packed very, very tight.

Arbitrary compression methods in general do not have a clean encapsulation, so if we want to use anything else to shrink stuff... OH NO WAIT. THIS IS GOING TO TURN WARC INTO AN "ARCHIVE FILE".