iipc/jwarc

WarcReader/GunzipChannel to check ByteBuffer in constructor

sebastian-nagel opened this issue · 3 comments

The classes WarcReader and GunzipChannel both have a public constructor taking a ByteBuffer as argument. It's silently assumed that the ByteBuffer

  • is backed by an array
  • is already in read mode (.flip() called)
  • (WarcReader only) the buffer's byte order is big-endian - otherwise the detection of the gzip magic fails

The constructors should put the buffers into correct state (if possible) or throw immediately throw an IllegalArgumentException. Alternatively, the constructors taking "external" could be removed or made non-public.

Unit tests are partially implemented (see 00effbd). However, I would like to discuss the expected behavior first. Comments welcome!

ato commented

Ah, yeah. The buffer handling is one aspect of the design I've never been totally confident of. What I was originally trying to aim for was minimizing the amount of buffers and data copies. Passing in a (potentially pre-populated) buffer was meant as an alternative to using some sort of PushbackInputStream. Just using it internally and not exposing it would indeed be one option or perhaps just documenting it as some sort of 'advanced' API that is only meant to be used in special cases.

is backed by an array

For Java 11+ I was hoping to eliminate this requirement as there's now an Inflater.inflate(ByteBuffer). But yes I guess it makes sense to check this early and throw an exception so it's clearer which buffer is the one to blame.

is already in read mode (.flip() called)

Is it actually possible to detect that?

the buffer's byte order is big-endian - otherwise the detection of the gzip magic fails

Ah, of course. It hadn't occurred to me at all that someone would pass in a little-endian buffer. I guess it makes sense to set it. Or actually.. maybe we should save the current endianess and set it back after reading the gzip header? That way if the caller plans to reuse the buffer afterwards for something else they aren't surprised its been changed on them.

Passing in a (potentially pre-populated) buffer was meant as an alternative to using some sort of PushbackInputStream.

Indeed, that could be the base for a couple of use cases when records can be selected by looking just into the header without actually parsing the entire record. I'll add a unit test for a pre-populated buffer.

Inflater.inflate(ByteBuffer)

Same problem, throws an exception if the buffer is read-only.

is already in read mode (.flip() called)

Is it actually possible to detect that?

No, there is no notion of "state" in the Buffer interface, the user must take care that .flip(), .compact() etc. are called appropriately. The problem with a buffer in "write mode" is that buffer get methods will read the data between position and limit (uninitialized, garbage or zeros). Calling .flip() twice will set position and limit to 0 which discards all buffered data. Hence, this would violate the use case of pre-populated buffers. In short, the user must call .flip() before passing the buffer to the constructor.

I'll open a PR soon to address the various cases best as possible.