Cannot append DictEncode columns to Stream
lmshk opened this issue · 0 comments
lmshk commented
On Arrow v2.7.1 (and Julia v1.10.1):
julia> t = (; a = Arrow.DictEncode([:a]))
(a = [:a],)
julia> Arrow.write("x.stream.arrow", t, file = false)
"x.stream.arrow"
julia> Arrow.append("x.stream.arrow", t)
ERROR: ArgumentError: Table schema does not match existing arrow file schema
Stacktrace: [...]
The problem is that the NamedTuple
has
julia> Tables.schema(t).types
(Arrow.DictEncodeType{Symbol},)
while the stream is identified as
julia> s = open("x.stream.arrow", "r+") do io
Arrow.stream_properties(io)
end;
julia> s[2].types
(Symbol,)
and there doesn't seem to be an easy workaround because append
doesn't allow overriding the arrow_schema
without effectively duplicating the other append methods' code on the user side. Omitting the Arrow.DictEncode
on subsequent segments doesn't work either:
julia> t2 = (; a = [:b])
(a = [:b],)
julia> Arrow.append("x.stream.arrow", t2)
"x.stream.arrow"
julia> d = Arrow.Table("x.stream.arrow")
Arrow.Table with 9 rows, 1 columns, and schema:
:a Symbol
julia> d.a
9-element SentinelArrays.ChainedVector{Symbol, Arrow.DictEncoded{Symbol, Int8, Arrow.List{Symbol, Int32, Vector{UInt8}}}}:
Error showing value of type SentinelArrays.ChainedVector{Symbol, Arrow.DictEncoded{Symbol, Int8, Arrow.List{Symbol, Int32, Vector{UInt8}}}}:
ERROR: ArgumentError: Symbol name may not contain \0
I am unsure whether changing is_equivalent_schema
would fix the issue because I don't understand if the downstream code (toarrowtable
?) can handle unequal schemas like this.
Please advise.