Unhandled sentinel value for len in compression causes invalid Array dimensions
DrChainsaw opened this issue ยท 5 comments
I'm generating a bunch of Arrow files from the apache java implementation and many of them are not readable by Arrow.jl (but they are readable by the java implementation).
When following the java decoding process in the debugger, it seems that both implementations agree up to the following line in the java implementation:
https://github.com/apache/arrow/blob/febd0ff144cfb8b2baffb1cb0be57ca40dc7cc77/java/vector/src/main/java/org/apache/arrow/vector/compression/AbstractCompressionCodec.java#L72-L75
It seems like length == -1 is some kind of sentinel value for no compression (maybe the compressor gave up or something?) which does not seem to be handled in the corresponding function in Arrow.jl:
Lines 521 to 524 in e893c32
I have verified that Arrow.jl indeed does read out len = -1 (which in turn causes an error saying invalid Array dimensions
when creating the decodedbytes vector).
but they are readable by the java implementation
are they readable by the pyarrow
?
are they readable by the pyarrow?
They are indeed!
can you share a small sample file? if not, can you tell us what pyarrow
report in terms of tyoe?
I don't know what causes the writer to select the uncompressed option and it did not happen in the simple sample files I created. I can try it some more if it is important.
I don't understand what is meant by "what pyarrow report in terms of tyoe". If you give me the command which makes the report and I see if I can send it.
I managed to produce a file which triggers the problem:
test.zip
julia> Arrow.Table("c:/temp\\arrowtest\\test/test.arrow")
ERROR: TaskFailedException
nested task error: ArgumentError: invalid Array dimensions
Stacktrace:
[1] Array
@ .\boot.jl:477 [inlined]
[2] uncompress(ptr::Ptr{UInt8}, buffer::Arrow.Flatbuf.Buffer, compression::Arrow.Flatbuf.BodyCompression)
@ Arrow \.julia\dev\Arrow\src\table.jl:529
[3] buildbitmap(batch::Arrow.Batch, rb::Arrow.Flatbuf.RecordBatch, nodeidx::Int64, bufferidx::Int64)
@ Arrow \.julia\dev\Arrow\src\table.jl:512
[4] build(f::Arrow.Flatbuf.Field, #unused#::Arrow.Flatbuf.Int, batch::Arrow.Batch, rb::Arrow.Flatbuf.RecordBatch, de::Dict{Int64, Arrow.DictEncoding}, nodeidx::Int64, bufferidx::Int64, convert::Bool)
@ Arrow \.julia\dev\Arrow\src\table.jl:683
[5] build(field::Arrow.Flatbuf.Field, batch::Arrow.Batch, rb::Arrow.Flatbuf.RecordBatch, de::Dict{Int64, Arrow.DictEncoding}, nodeidx::Int64, bufferidx::Int64, convert::Bool)
@ Arrow \.julia\dev\Arrow\src\table.jl:498
[6] iterate(x::Arrow.VectorIterator, ::Tuple{Int64, Int64, Int64})
@ Arrow \.julia\dev\Arrow\src\table.jl:474
[7] iterate
@ \.julia\packages\Arrow\rYdxZ\src\table.jl:471 [inlined]
[8] copyto!(dest::Vector{Any}, src::Arrow.VectorIterator)
@ Base .\abstractarray.jl:946
[9] _collect
@ .\array.jl:713 [inlined]
[10] collect
@ .\array.jl:707 [inlined]
[11] macro expansion
@ \.julia\packages\Arrow\rYdxZ\src\table.jl:376 [inlined]
[12] (::Arrow.var"#108#114"{Bool, Channel{Any}, WorkerUtilities.OrderedSynchronizer, Dict{Int64, Arrow.DictEncoding}, Arrow.Batch, Int64})()
@ Arrow .\threadingconstructs.jl:341
Stacktrace:
[1] sync_end(c::Channel{Any})
@ Base .\task.jl:445
[2] macro expansion
@ .\task.jl:477 [inlined]
[3] Arrow.Table(blobs::Vector{Arrow.ArrowBlob}; convert::Bool)
@ Arrow \.julia\dev\Arrow\src\table.jl:321
[4] Table
@ \.julia\packages\Arrow\rYdxZ\src\table.jl:295 [inlined]
[5] #Table#98
@ \.julia\packages\Arrow\rYdxZ\src\table.jl:290 [inlined]
[6] Table
@ \.julia\packages\Arrow\rYdxZ\src\table.jl:290 [inlined]
[7] Arrow.Table(input::String)
@ Arrow \.julia\dev\Arrow\src\table.jl:290
[8] top-level scope
@ REPL[27]:1
With #436
julia> Arrow.Table("c:/temp\\arrowtest\\test/test.arrow") |> DataFrame
102ร15 DataFrame
Row โ isA intkey primitiveIntkey doublekey booleanKey numberkey primitiveNumberkey stringkey objectkey arrayKey NrofSamples Max Min Sum SqrSum
โ Int32 Int32 Int32 Float64 Bool Float64 Float64 String String String Int32 Float64 Float64 Float64 Float64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 0 1 2 3.0 true 4.0 5.0 6 StringObject{string='7'} [I@4dd6fd0a 2 100.0 10.0 110.0 10100.0
2 โ 0 10 20 30.0 false 40.0 50.0 60 StringObject{string='70'} [I@bb9e6dc 1 100.0 100.0 100.0 10000.0
3 โ 0 11 20 30.0 false 40.0 50.0 60 StringObject{string='70'} [I@bb9e6dc 1 100.0 100.0 100.0 10000.0
4 โ 0 12 20 30.0 false 40.0 50.0 60 StringObject{string='70'} [I@bb9e6dc 1 100.0 100.0 100.0 10000.0
5 โ 0 13 20 30.0 false 40.0 50.0 60 StringObject{string='70'} [I@bb9e6dc 1 100.0 100.0 100.0 10000.0
6 โ 0 14 20 30.0 false 40.0 50.0 60 StringObject{string='70'} [I@bb9e6dc 1 100.0 100.0 100.0 10000.0
7 โ 0 15 20 30.0 false 40.0 50.0 60 StringObject{string='70'} [I@bb9e6dc 1 100.0 100.0 100.0 10000.0
8 โ 0 16 20 30.0 false 40.0 50.0 60 StringObject{string='70'} [I@bb9e6dc 1 100.0 100.0 100.0 10000.0
9 โ 0 17 20 30.0 false 40.0 50.0 60 StringObject{string='70'} [I@bb9e6dc 1 100.0 100.0 100.0 10000.0
10 โ 0 18 20 30.0 false 40.0 50.0 60 StringObject{string='70'} [I@bb9e6dc 1 100.0 100.0 100.0 10000.0
11 โ 0 19 20 30.0 false 40.0 50.0 60 StringObject{string='70'} [I@bb9e6dc 1 100.0 100.0 100.0 10000.0
12 โ 0 20 20 30.0 false 40.0 50.0 60 StringObject{string='70'} [I@bb9e6dc 1 100.0 100.0 100.0 10000.0
13 โ 0 21 20 30.0 false 40.0 50.0 60 StringObject{string='70'} [I@bb9e6dc 1 100.0 100.0 100.0 10000.0
14 โ 0 22 20 30.0 false 40.0 50.0 60 StringObject{string='70'} [I@bb9e6dc 1 100.0 100.0 100.0 10000.0
Loads with pyarrow ootb:
julia> pywith(pyarrow.ipc.open_file("c:/temp\\arrowtest\\test/test.arrow")) do reader
reader.read_pandas()
end
Python DataFrame:
isA intkey primitiveIntkey doublekey booleanKey ... NrofSamples Max Min Sum SqrSum
0 0 1 2 3.0 True ... 2 100.0 10.0 110.0 10100.0
1 0 10 20 30.0 False ... 1 100.0 100.0 100.0 10000.0
2 0 11 20 30.0 False ... 1 100.0 100.0 100.0 10000.0
3 0 12 20 30.0 False ... 1 100.0 100.0 100.0 10000.0
4 0 13 20 30.0 False ... 1 100.0 100.0 100.0 10000.0
.. ... ... ... ... ... ... ... ... ... ... ...
97 0 106 20 30.0 False ... 1 100.0 100.0 100.0 10000.0
98 0 107 20 30.0 False ... 1 100.0 100.0 100.0 10000.0
99 0 108 20 30.0 False ... 1 100.0 100.0 100.0 10000.0
100 0 109 20 30.0 False ... 1 100.0 100.0 100.0 10000.0
101 1 10 20 30.0 False ... 1 100.0 100.0 100.0 10000.0
[102 rows x 15 columns]