`eachindex()` on columns producing unexpected result
Moelf opened this issue ยท 7 comments
eachindex()
can return useful index or garbage depending on the size (more precisely, # of batches) of the Arrow file:
julia> using Arrow
julia> function gendata()
x = [rand(rand(0:10)) for _ = 1:3]
y = [randn(rand(0:10)) for _ = 1:3]
(;x, y)
end
julia> Arrow.append("./out2.feather", gendata())
"./out2.feather"
julia> tbl2 = Arrow.Table("./out2.feather")
Arrow.Table with 3 rows, 2 columns, and schema:
:x Vector{Float64} (alias for Array{Float64, 1})
:y Vector{Float64} (alias for Array{Float64, 1})
julia> tbl2.x
3-element Arrow.List{Vector{Float64}, Int32, Arrow.Primitive{Float64, Vector{Float64}}}:
[0.38906822055013135]
[0.9737332599073911, 0.19757843080977833, 0.06268161159155239, 0.9716472746828138, 0.8838743377016612, 0.951662206335832, 0.5973352614611379, 0.8876145238421496, 0.5149018084376046]
[0.9363691598185577, 0.3603564376862347, 0.6425331531623392, 0.4465148469221817, 0.9705740924873874, 0.7543076133441144, 0.9591142326905828, 0.1521499948934163, 0.49812895258696377]
julia> eachindex(tbl2.x)
Base.OneTo(3)
now append more data to create multiple batches
julia> foreach(1:2) do _
Arrow.append("./out2.feather", gendata())
end
julia> tbl2 = Arrow.Table("./out2.feather")
Arrow.Table with 6 rows, 2 columns, and schema:
:x Vector{Float64} (alias for Array{Float64, 1})
:y Vector{Float64} (alias for Array{Float64, 1})
julia> eachindex(tbl2.x)
SentinelArrays.IndexIterator{Arrow.List{Vector{Float64}, Int32, Arrow.Primitive{Float64, Vector{Float64}}}}(Arrow.List{Vector{Float64}, Int32, Arrow.Primitive{Float64, Vector{Float64}}}[[[0.23351930418379585, 0.260224999331924, 0.999696889776187], [0.6032929047338089, 0.033296127748308146, 0.8605671394492301], []], [[], [0.5499242945328698], [0.9983428836825877, 0.4739924450948647, 0.5605478962869866, 0.07106076661873084, 0.6498518642395072, 0.548789522281429, 0.2140222718354795]]]
I understand this is because SentinelArrays, but this is really unintuitive
What do you mean by garbage exactly, the verbose type? Does it work to actually index with the result? (Not sure if this is actually a bug)
it works but users have to be extremely careful:
julia> [tbl2.x[i] for i in eachindex(tbl2.x)] == collect(tbl2.x)
true
julia> [tbl2.y[i] for i in eachindex(tbl2.y)] == collect(tbl2.y)
true
julia> [tbl2.y[i] for i in eachindex(tbl2.x)] == collect(tbl2.y)
false
julia> [tbl2.x[i] for i in eachindex(tbl2.y)] == collect(tbl2.x)
false
the last 2 lines break because SentinelArrays.IndexIterator
which only happens when the table have multiple batches.
Notice, if the user remembered to use eachindex(tbl2.x, tbl2.y)
it would normalize to OneTo(6)
and everything would be fine. But the problem is neither of the last 2 lines threw error, just silently corrupted result
Ah, yeah, that seems non-ideal, although maybe not directly a bug.
Maybe some options here are to change the return here, change that special index so it errors when used with another vector, or if neither of those can be done performantly, just try to document this behavior.
I don't mind what SentinelArrays.jl do and I get it's for performance.
But just the fact that given the same Arrow schema, the type of eachindex()
of the same column changes depending on if the file contains 1 batch or more than 1 batch seems bad enough already, and there's the potential silent incorrect result behavior.
IMO Arrow is exposing too many underlying library's array type to users directly
Hmmmm, it seems pretty bad form though to be trying to use the indices of one array to index another? Like, I get that having a better error here would be nice, but that just seems like a really big no-no from the start.
because all file format doesn't have a "vanilla" way to express "these two jagged columns have same number of elements row by row", physicists in my field are somewhat used to doing this.\
maybe we can add a parent() === ...
check in getindex()
? it should get optimized away
Alright, I think this is a pretty easy assertion: JuliaData/SentinelArrays.jl#90