
Unhandled sentinel value for len in compression causes invalid Array dimensions

I'm generating a bunch of Arrow files from the apache java implementation and many of them are not readable by Arrow.jl (but they are readable by the java implementation).

When following the java decoding process in the debugger, it seems that both implementations agree up to the following line in the java implementation:

It seems like length == -1 is some kind of sentinel value for no compression (maybe the compressor gave up or something?) which does not seem to be handled in the corresponding function in Arrow.jl:


len = unsafe_load(convert(Ptr{Int64}, ptr))
ptr += 8 # skip past uncompressed length as Int64
encodedbytes = unsafe_wrap(Array, ptr, buffer.length - 8)
decodedbytes = Vector{UInt8}(undef, len)

I have verified that Arrow.jl indeed does read out len = -1 (which in turn causes an error saying invalid Array dimensions when creating the decodedbytes vector).

but they are readable by the java implementation

are they readable by the pyarrow?

They are indeed!

can you share a small sample file? if not, can you tell us what pyarrow report in terms of tyoe?

I don't know what causes the writer to select the uncompressed option and it did not happen in the simple sample files I created. I can try it some more if it is important.

I don't understand what is meant by "what pyarrow report in terms of tyoe". If you give me the command which makes the report and I see if I can send it.

I managed to produce a file which triggers the problem:

julia> Arrow.Table("c:/temp\\arrowtest\\test/test.arrow")
ERROR: TaskFailedException

    nested task error: ArgumentError: invalid Array dimensions
With #436

julia> Arrow.Table("c:/temp\\arrowtest\\test/test.arrow") |> DataFrame
102×15 DataFrame
Loads with pyarrow ootb:

julia> pywith(pyarrow.ipc.open_file("c:/temp\\arrowtest\\test/test.arrow")) do reader
Python DataFrame:
[102 rows x 15 columns]