apache/arrow-julia

Cannot append DictEncode columns to Stream

Opened this issue · 0 comments

On Arrow v2.7.1 (and Julia v1.10.1):

julia> t = (; a = Arrow.DictEncode([:a]))
(a = [:a],)

julia> Arrow.write("x.stream.arrow", t, file = false)
"x.stream.arrow"

julia> Arrow.append("x.stream.arrow", t)
ERROR: ArgumentError: Table schema does not match existing arrow file schema
Stacktrace: [...]

The problem is that the NamedTuple has

julia> Tables.schema(t).types
(Arrow.DictEncodeType{Symbol},)

while the stream is identified as

julia> s = open("x.stream.arrow", "r+") do io
    Arrow.stream_properties(io)
end;
julia> s[2].types
(Symbol,)

and there doesn't seem to be an easy workaround because append doesn't allow overriding the arrow_schema without effectively duplicating the other append methods' code on the user side. Omitting the Arrow.DictEncode on subsequent segments doesn't work either:

julia> t2 = (; a = [:b])
(a = [:b],)

julia> Arrow.append("x.stream.arrow", t2)
"x.stream.arrow"

julia> d = Arrow.Table("x.stream.arrow")
Arrow.Table with 9 rows, 1 columns, and schema:
 :a  Symbol

julia> d.a
9-element SentinelArrays.ChainedVector{Symbol, Arrow.DictEncoded{Symbol, Int8, Arrow.List{Symbol, Int32, Vector{UInt8}}}}:
Error showing value of type SentinelArrays.ChainedVector{Symbol, Arrow.DictEncoded{Symbol, Int8, Arrow.List{Symbol, Int32, Vector{UInt8}}}}:
ERROR: ArgumentError: Symbol name may not contain \0

I am unsure whether changing is_equivalent_schema would fix the issue because I don't understand if the downstream code (toarrowtable?) can handle unequal schemas like this.

Please advise.