apache/arrow-julia

Unhandled sentinel value for len in compression causes invalid Array dimensions

DrChainsaw opened this issue ยท 5 comments

I'm generating a bunch of Arrow files from the apache java implementation and many of them are not readable by Arrow.jl (but they are readable by the java implementation).

When following the java decoding process in the debugger, it seems that both implementations agree up to the following line in the java implementation:
https://github.com/apache/arrow/blob/febd0ff144cfb8b2baffb1cb0be57ca40dc7cc77/java/vector/src/main/java/org/apache/arrow/vector/compression/AbstractCompressionCodec.java#L72-L75

It seems like length == -1 is some kind of sentinel value for no compression (maybe the compressor gave up or something?) which does not seem to be handled in the corresponding function in Arrow.jl:

arrow-julia/src/table.jl

Lines 521 to 524 in e893c32

len = unsafe_load(convert(Ptr{Int64}, ptr))
ptr += 8 # skip past uncompressed length as Int64
encodedbytes = unsafe_wrap(Array, ptr, buffer.length - 8)
decodedbytes = Vector{UInt8}(undef, len)

I have verified that Arrow.jl indeed does read out len = -1 (which in turn causes an error saying invalid Array dimensions when creating the decodedbytes vector).

Moelf commented

but they are readable by the java implementation

are they readable by the pyarrow?

are they readable by the pyarrow?

They are indeed!

Moelf commented

can you share a small sample file? if not, can you tell us what pyarrow report in terms of tyoe?

I don't know what causes the writer to select the uncompressed option and it did not happen in the simple sample files I created. I can try it some more if it is important.

I don't understand what is meant by "what pyarrow report in terms of tyoe". If you give me the command which makes the report and I see if I can send it.

I managed to produce a file which triggers the problem:
test.zip

julia> Arrow.Table("c:/temp\\arrowtest\\test/test.arrow")
ERROR: TaskFailedException

    nested task error: ArgumentError: invalid Array dimensions
    Stacktrace:
      [1] Array
        @ .\boot.jl:477 [inlined]
      [2] uncompress(ptr::Ptr{UInt8}, buffer::Arrow.Flatbuf.Buffer, compression::Arrow.Flatbuf.BodyCompression)
        @ Arrow \.julia\dev\Arrow\src\table.jl:529
      [3] buildbitmap(batch::Arrow.Batch, rb::Arrow.Flatbuf.RecordBatch, nodeidx::Int64, bufferidx::Int64)
        @ Arrow \.julia\dev\Arrow\src\table.jl:512
      [4] build(f::Arrow.Flatbuf.Field, #unused#::Arrow.Flatbuf.Int, batch::Arrow.Batch, rb::Arrow.Flatbuf.RecordBatch, de::Dict{Int64, Arrow.DictEncoding}, nodeidx::Int64, bufferidx::Int64, convert::Bool)
        @ Arrow \.julia\dev\Arrow\src\table.jl:683
      [5] build(field::Arrow.Flatbuf.Field, batch::Arrow.Batch, rb::Arrow.Flatbuf.RecordBatch, de::Dict{Int64, Arrow.DictEncoding}, nodeidx::Int64, bufferidx::Int64, convert::Bool)
        @ Arrow \.julia\dev\Arrow\src\table.jl:498
      [6] iterate(x::Arrow.VectorIterator, ::Tuple{Int64, Int64, Int64})
        @ Arrow \.julia\dev\Arrow\src\table.jl:474
      [7] iterate
        @ \.julia\packages\Arrow\rYdxZ\src\table.jl:471 [inlined]
      [8] copyto!(dest::Vector{Any}, src::Arrow.VectorIterator)
        @ Base .\abstractarray.jl:946
      [9] _collect
        @ .\array.jl:713 [inlined]
     [10] collect
        @ .\array.jl:707 [inlined]
     [11] macro expansion
        @ \.julia\packages\Arrow\rYdxZ\src\table.jl:376 [inlined]
     [12] (::Arrow.var"#108#114"{Bool, Channel{Any}, WorkerUtilities.OrderedSynchronizer, Dict{Int64, Arrow.DictEncoding}, Arrow.Batch, Int64})()
        @ Arrow .\threadingconstructs.jl:341
Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base .\task.jl:445
 [2] macro expansion
   @ .\task.jl:477 [inlined]
 [3] Arrow.Table(blobs::Vector{Arrow.ArrowBlob}; convert::Bool)
   @ Arrow \.julia\dev\Arrow\src\table.jl:321
 [4] Table
   @ \.julia\packages\Arrow\rYdxZ\src\table.jl:295 [inlined]
 [5] #Table#98
   @ \.julia\packages\Arrow\rYdxZ\src\table.jl:290 [inlined]
 [6] Table
   @ \.julia\packages\Arrow\rYdxZ\src\table.jl:290 [inlined]
 [7] Arrow.Table(input::String)
   @ Arrow \.julia\dev\Arrow\src\table.jl:290
 [8] top-level scope
   @ REPL[27]:1

With #436

julia> Arrow.Table("c:/temp\\arrowtest\\test/test.arrow") |> DataFrame
102ร—15 DataFrame
 Row โ”‚ isA    intkey  primitiveIntkey  doublekey  booleanKey  numberkey  primitiveNumberkey  stringkey  objectkey                  arrayKey     NrofSamples  Max      Min      Sum      SqrSum  
     โ”‚ Int32  Int32   Int32            Float64    Bool        Float64    Float64             String     String                     String       Int32        Float64  Float64  Float64  Float64 
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   1 โ”‚     0       1                2        3.0        true        4.0                 5.0  6          StringObject{string='7'}   [I@4dd6fd0a            2    100.0     10.0    110.0  10100.0
   2 โ”‚     0      10               20       30.0       false       40.0                50.0  60         StringObject{string='70'}  [I@bb9e6dc             1    100.0    100.0    100.0  10000.0
   3 โ”‚     0      11               20       30.0       false       40.0                50.0  60         StringObject{string='70'}  [I@bb9e6dc             1    100.0    100.0    100.0  10000.0
   4 โ”‚     0      12               20       30.0       false       40.0                50.0  60         StringObject{string='70'}  [I@bb9e6dc             1    100.0    100.0    100.0  10000.0
   5 โ”‚     0      13               20       30.0       false       40.0                50.0  60         StringObject{string='70'}  [I@bb9e6dc             1    100.0    100.0    100.0  10000.0
   6 โ”‚     0      14               20       30.0       false       40.0                50.0  60         StringObject{string='70'}  [I@bb9e6dc             1    100.0    100.0    100.0  10000.0
   7 โ”‚     0      15               20       30.0       false       40.0                50.0  60         StringObject{string='70'}  [I@bb9e6dc             1    100.0    100.0    100.0  10000.0
   8 โ”‚     0      16               20       30.0       false       40.0                50.0  60         StringObject{string='70'}  [I@bb9e6dc             1    100.0    100.0    100.0  10000.0
   9 โ”‚     0      17               20       30.0       false       40.0                50.0  60         StringObject{string='70'}  [I@bb9e6dc             1    100.0    100.0    100.0  10000.0
  10 โ”‚     0      18               20       30.0       false       40.0                50.0  60         StringObject{string='70'}  [I@bb9e6dc             1    100.0    100.0    100.0  10000.0
  11 โ”‚     0      19               20       30.0       false       40.0                50.0  60         StringObject{string='70'}  [I@bb9e6dc             1    100.0    100.0    100.0  10000.0
  12 โ”‚     0      20               20       30.0       false       40.0                50.0  60         StringObject{string='70'}  [I@bb9e6dc             1    100.0    100.0    100.0  10000.0
  13 โ”‚     0      21               20       30.0       false       40.0                50.0  60         StringObject{string='70'}  [I@bb9e6dc             1    100.0    100.0    100.0  10000.0
  14 โ”‚     0      22               20       30.0       false       40.0                50.0  60         StringObject{string='70'}  [I@bb9e6dc             1    100.0    100.0    100.0  10000.0

Loads with pyarrow ootb:

julia> pywith(pyarrow.ipc.open_file("c:/temp\\arrowtest\\test/test.arrow")) do reader
       reader.read_pandas()
       end
Python DataFrame:
     isA  intkey  primitiveIntkey  doublekey  booleanKey  ...  NrofSamples    Max    Min    Sum   SqrSum
0      0       1                2        3.0        True  ...            2  100.0   10.0  110.0  10100.0
1      0      10               20       30.0       False  ...            1  100.0  100.0  100.0  10000.0
2      0      11               20       30.0       False  ...            1  100.0  100.0  100.0  10000.0
3      0      12               20       30.0       False  ...            1  100.0  100.0  100.0  10000.0
4      0      13               20       30.0       False  ...            1  100.0  100.0  100.0  10000.0
..   ...     ...              ...        ...         ...  ...          ...    ...    ...    ...      ...
97     0     106               20       30.0       False  ...            1  100.0  100.0  100.0  10000.0
98     0     107               20       30.0       False  ...            1  100.0  100.0  100.0  10000.0
99     0     108               20       30.0       False  ...            1  100.0  100.0  100.0  10000.0
100    0     109               20       30.0       False  ...            1  100.0  100.0  100.0  10000.0
101    1      10               20       30.0       False  ...            1  100.0  100.0  100.0  10000.0

[102 rows x 15 columns]