ChrisRx/quickavro

INITIAL_HEADER_SIZE constant limits size of schema for Avro files

gth828r opened this issue · 3 comments

I have a schema which contains a top level union comprised of complex record types. The schema itself is over 30,000 bytes long. The INITIAL_HEADER_SIZE, which is currently set to 8192 bytes prevents me from decoding Avro files, because the header which gets read in does not include all of the bytes in my schema.

If the code makes an initial assumption about the maximum likely header size, it should deal with the case where the header size is in fact larger than that assumed size.

You might be able to handle this entirely in python. Depending on whether this is python 2 or python 3, you would catch an exception generated by the C API (should be an IOError or an OSError, respectively). If that error occurs, you could then seek back to the beginning of the file and try to read the header again with a much larger guess about the max header size (512 kB instead of 8 kB for example).

I don't think I'll have time to try that and test it today, and I won't be available for a few weeks. If no one else has gotten to this by the time I am available, I can try that out and send a PR if it works.

For completeness, I should also mention a couple of other similar approaches. One would be to just always make a really big guess as to the header size. I am assuming something is actually reserving that memory under the hood, so that seems like it could be very inefficient.

Another would be to do something totally dynamic, such as doubling (or squaring) the current guess at the block size each time the failure occurred, stopping at some absolute max. That seems a bit unnecessary to me, because I think most Avro schemas are going to fall into a set of classes, such as:

  • Small (definitely fitting in the 8 kB header)
  • Large (almost certainly fitting within 512 kB)

I don't have a good sense of precisely how large people make their Avro schemas, so this may not be good enough. But given that my team's schema already seems big at tens-of-kB, I'm going to guess that hundreds-of-kB is enough for other special cases.

Yeah, I think it is a good idea to just make it dynamic and just keep pulling in more bytes from the file descriptor until it either has a valid record (and therefore a valid header) or fails with EOF. I added the quickavro.ReadError error in the C extension so that it can be easily caught and then continue to read from the file. Hopefully this will be a good strategy for dealing with headers of any arbitrary length so this won't be an issue. Also definitely let me know if it doesn't work for you and I can try and see what's up. And thanks again for all your help too!