`Vector{UInt8}` mis-represented when writing to disk
Moelf opened this issue · 5 comments
julia> using Arrow, DataFrames
julia> df = DataFrame(; x = [[0x01, 0x02], UInt8[], [0x03]])
3×1 DataFrame
Row │ x
│ Array…
─────┼───────────────────
1 │ UInt8[0x01, 0x02]
2 │ UInt8[]
3 │ UInt8[0x03]
julia> Arrow.write("/tmp/julia.feather", df)
"/tmp/julia.feather"
instead of Vector{UInt8}
, it ended up being seen as byte-string
In [1]: import pyarrow.feather
In [3]: pyarrow.feather.read_table("/tmp/julia.feather")["x"]
Out[3]:
<pyarrow.lib.ChunkedArray object at 0x7fb2994c86d0>
[
[
0102,
,
03
]
]
to show that pyarrow
does something different and consistent:
In [8]: import pyarrow.feather, numpy as np, pandas as pd
In [9]: df = pd.DataFrame({"x": [[np.uint8(0)], [np.uint8(1), np.uint8(2)]]})
In [11]: pyarrow.feather.write_feather(df, "/tmp/pyarrow.feather", compression="uncompressed")
In [12]: pyarrow.feather.read_table("/tmp/pyarrow.feather")["x"]
Out[12]:
<pyarrow.lib.ChunkedArray object at 0x7f80e3f93ec0>
[
[
[
0
],
[
1,
2
]
]
]
read it back from Julia
julia> Arrow.Table("/tmp/pyarrow.feather").x
2-element Arrow.List{Union{Missing, Vector{Union{Missing, UInt8}}}, Int32, Arrow.Primitive{Union{Missing, UInt8}, Vector{UInt8}}}:
Union{Missing, UInt8}[0x00]
Union{Missing, UInt8}[0x01, 0x02]
I did some digging
diff --git a/src/arraytypes/arraytypes.jl b/src/arraytypes/arraytypes.jl
index f3cee5d..a338004 100644
--- a/src/arraytypes/arraytypes.jl
+++ b/src/arraytypes/arraytypes.jl
@@ -34,7 +34,9 @@ Base.deleteat!(x::T, inds) where {T <: ArrowVector} = throw(ArgumentError("`$T`
function toarrowvector(x, i=1, de=Dict{Int64, Any}(), ded=DictEncoding[], meta=getmetadata(x); compression::Union{Nothing, Vector{LZ4FrameCompressor}, LZ4FrameCompressor, Vector{ZstdCompressor}, ZstdCompressor}=nothing, kw...)
@debugv 2 "converting top-level column to arrow format: col = $(typeof(x)), compression = $compression, kw = $(values(kw))"
@debugv 3 x
+ @show typeof(x)
A = arrowvector(x, i, 0, 0, de, ded, meta; compression=compression, kw...)
+ @show typeof(A)
if compression isa LZ4FrameCompressor
A = compress(Meta.CompressionTypes.LZ4_FRAME, compression, A)
elseif compression isa Vector{LZ4FrameCompressor}
julia> data = (; x = [[0x01, 0x02], UInt8[], [0x03]], y = [[0, 1], Int[], [2,3]])
(x = Vector{UInt8}[[0x01, 0x02], [], [0x03]], y = [[0, 1], Int64[], [2, 3]])
julia> Arrow.write("/tmp/bug411.feather", data)
typeof(x) = Vector{Vector{UInt8}}
typeof(A) = Arrow.List{Vector{UInt8}, Int32, Arrow.ToList{UInt8, false, Vector{UInt8}, Int32}}
typeof(x) = Vector{Vector{Int64}}
typeof(A) = Arrow.List{Vector{Int64}, Int32, Arrow.Primitive{Int64, Arrow.ToList{Int64, false, Vector{Int64}, Int32}}}
"/tmp/bug411.feather"
the question is why UInt8
is built ToList
while Int64
is Primitive while both of them seem to be possible primitive https://arrow.apache.org/docs/python/generated/pyarrow.uint8.html#pyarrow.uint8
arrow-julia/src/arraytypes/list.jl
Lines 192 to 197 in c469151
this seems to be the reason, and one step back, ToList()
converts both into flat Vector{UInt8}
so it's not distinguishable if you only look at variable flat
we also hit this part:
Lines 405 to 407 in c469151
all in all it seems like a deliberate choice which I think is wrong, given pyarrow behavior and application of Vector{UInt8}
that's not byte-string
I think it's a reasonable request to not treat Vector{UInt8}
as the Binary
arrow type and only have CodeUnits
be treated that way.