apache/arrow-julia

`Vector{UInt8}` mis-represented when writing to disk

Moelf opened this issue · 5 comments

Moelf commented
julia> using Arrow, DataFrames

julia> df = DataFrame(; x = [[0x01, 0x02], UInt8[], [0x03]])
3×1 DataFrame
 Row │ x
     │ Array
─────┼───────────────────
   1 │ UInt8[0x01, 0x02]
   2 │ UInt8[]
   3 │ UInt8[0x03]

julia> Arrow.write("/tmp/julia.feather", df)
"/tmp/julia.feather"

instead of Vector{UInt8}, it ended up being seen as byte-string

In [1]: import pyarrow.feather

In [3]: pyarrow.feather.read_table("/tmp/julia.feather")["x"]
Out[3]:
<pyarrow.lib.ChunkedArray object at 0x7fb2994c86d0>
[
  [
    0102,
    ,
    03
  ]
]
Moelf commented

to show that pyarrow does something different and consistent:

In [8]: import pyarrow.feather, numpy as np, pandas as pd

In [9]: df = pd.DataFrame({"x": [[np.uint8(0)], [np.uint8(1), np.uint8(2)]]})

In [11]: pyarrow.feather.write_feather(df, "/tmp/pyarrow.feather", compression="uncompressed")

In [12]: pyarrow.feather.read_table("/tmp/pyarrow.feather")["x"]
Out[12]:
<pyarrow.lib.ChunkedArray object at 0x7f80e3f93ec0>
[
  [
    [
      0
    ],
    [
      1,
      2
    ]
  ]
]

read it back from Julia

julia> Arrow.Table("/tmp/pyarrow.feather").x
2-element Arrow.List{Union{Missing, Vector{Union{Missing, UInt8}}}, Int32, Arrow.Primitive{Union{Missing, UInt8}, Vector{UInt8}}}:
 Union{Missing, UInt8}[0x00]
 Union{Missing, UInt8}[0x01, 0x02]
Moelf commented

I did some digging

diff --git a/src/arraytypes/arraytypes.jl b/src/arraytypes/arraytypes.jl
index f3cee5d..a338004 100644
--- a/src/arraytypes/arraytypes.jl
+++ b/src/arraytypes/arraytypes.jl
@@ -34,7 +34,9 @@ Base.deleteat!(x::T, inds) where {T <: ArrowVector} = throw(ArgumentError("`$T`
 function toarrowvector(x, i=1, de=Dict{Int64, Any}(), ded=DictEncoding[], meta=getmetadata(x); compression::Union{Nothing, Vector{LZ4FrameCompressor}, LZ4FrameCompressor, Vector{ZstdCompressor}, ZstdCompressor}=nothing, kw...)
     @debugv 2 "converting top-level column to arrow format: col = $(typeof(x)), compression = $compression, kw = $(values(kw))"
     @debugv 3 x
+    @show typeof(x)
     A = arrowvector(x, i, 0, 0, de, ded, meta; compression=compression, kw...)
+    @show typeof(A)
     if compression isa LZ4FrameCompressor
         A = compress(Meta.CompressionTypes.LZ4_FRAME, compression, A)
     elseif compression isa Vector{LZ4FrameCompressor}
julia> data = (; x = [[0x01, 0x02], UInt8[], [0x03]], y = [[0, 1], Int[], [2,3]])
(x = Vector{UInt8}[[0x01, 0x02], [], [0x03]], y = [[0, 1], Int64[], [2, 3]])

julia> Arrow.write("/tmp/bug411.feather", data)
typeof(x) = Vector{Vector{UInt8}}
typeof(A) = Arrow.List{Vector{UInt8}, Int32, Arrow.ToList{UInt8, false, Vector{UInt8}, Int32}}
typeof(x) = Vector{Vector{Int64}}
typeof(A) = Arrow.List{Vector{Int64}, Int32, Arrow.Primitive{Int64, Arrow.ToList{Int64, false, Vector{Int64}, Int32}}}
"/tmp/bug411.feather"

the question is why UInt8 is built ToList while Int64 is Primitive while both of them seem to be possible primitive https://arrow.apache.org/docs/python/generated/pyarrow.uint8.html#pyarrow.uint8

Moelf commented

flat = ToList(x; largelists=largelists)
offsets = Offsets(UInt8[], flat.inds)
if eltype(flat) == UInt8 # binary or utf8string
data = flat
T = origtype(flat)
else

this seems to be the reason, and one step back, ToList() converts both into flat Vector{UInt8} so it's not distinguishable if you only look at variable flat

Moelf commented

we also hit this part:

arrow-julia/src/eltypes.jl

Lines 405 to 407 in c469151

else # if Vector{UInt8}
if O == Int32
Meta.binaryStart(b)

all in all it seems like a deliberate choice which I think is wrong, given pyarrow behavior and application of Vector{UInt8} that's not byte-string

quinnj commented

I think it's a reasonable request to not treat Vector{UInt8} as the Binary arrow type and only have CodeUnits be treated that way.