apache/arrow-julia

`eachindex()` on columns producing unexpected result

Moelf opened this issue ยท 7 comments

Moelf commented

eachindex() can return useful index or garbage depending on the size (more precisely, # of batches) of the Arrow file:

julia> using Arrow

julia> function gendata()
           x = [rand(rand(0:10)) for _ = 1:3]
           y = [randn(rand(0:10)) for _ = 1:3]
           (;x, y)
       end

julia> Arrow.append("./out2.feather", gendata())
"./out2.feather"

julia> tbl2 = Arrow.Table("./out2.feather")
Arrow.Table with 3 rows, 2 columns, and schema:
 :x  Vector{Float64} (alias for Array{Float64, 1})
 :y  Vector{Float64} (alias for Array{Float64, 1})

julia> tbl2.x
3-element Arrow.List{Vector{Float64}, Int32, Arrow.Primitive{Float64, Vector{Float64}}}:
 [0.38906822055013135]
 [0.9737332599073911, 0.19757843080977833, 0.06268161159155239, 0.9716472746828138, 0.8838743377016612, 0.951662206335832, 0.5973352614611379, 0.8876145238421496, 0.5149018084376046]
 [0.9363691598185577, 0.3603564376862347, 0.6425331531623392, 0.4465148469221817, 0.9705740924873874, 0.7543076133441144, 0.9591142326905828, 0.1521499948934163, 0.49812895258696377]

julia> eachindex(tbl2.x)
Base.OneTo(3)

now append more data to create multiple batches

julia> foreach(1:2) do _
           Arrow.append("./out2.feather", gendata())
       end

julia> tbl2 = Arrow.Table("./out2.feather")
Arrow.Table with 6 rows, 2 columns, and schema:
 :x  Vector{Float64} (alias for Array{Float64, 1})
 :y  Vector{Float64} (alias for Array{Float64, 1})

julia> eachindex(tbl2.x)
SentinelArrays.IndexIterator{Arrow.List{Vector{Float64}, Int32, Arrow.Primitive{Float64, Vector{Float64}}}}(Arrow.List{Vector{Float64}, Int32, Arrow.Primitive{Float64, Vector{Float64}}}[[[0.23351930418379585, 0.260224999331924, 0.999696889776187], [0.6032929047338089, 0.033296127748308146, 0.8605671394492301], []], [[], [0.5499242945328698], [0.9983428836825877, 0.4739924450948647, 0.5605478962869866, 0.07106076661873084, 0.6498518642395072, 0.548789522281429, 0.2140222718354795]]]

I understand this is because SentinelArrays, but this is really unintuitive

What do you mean by garbage exactly, the verbose type? Does it work to actually index with the result? (Not sure if this is actually a bug)

Moelf commented

it works but users have to be extremely careful:

julia> [tbl2.x[i] for i in eachindex(tbl2.x)] == collect(tbl2.x)
true

julia> [tbl2.y[i] for i in eachindex(tbl2.y)] == collect(tbl2.y)
true

julia> [tbl2.y[i] for i in eachindex(tbl2.x)] == collect(tbl2.y)
false
julia> [tbl2.x[i] for i in eachindex(tbl2.y)] == collect(tbl2.x)
false

the last 2 lines break because SentinelArrays.IndexIterator which only happens when the table have multiple batches.

Notice, if the user remembered to use eachindex(tbl2.x, tbl2.y) it would normalize to OneTo(6) and everything would be fine. But the problem is neither of the last 2 lines threw error, just silently corrupted result

Ah, yeah, that seems non-ideal, although maybe not directly a bug.

Maybe some options here are to change the return here, change that special index so it errors when used with another vector, or if neither of those can be done performantly, just try to document this behavior.

Moelf commented

I don't mind what SentinelArrays.jl do and I get it's for performance.

But just the fact that given the same Arrow schema, the type of eachindex() of the same column changes depending on if the file contains 1 batch or more than 1 batch seems bad enough already, and there's the potential silent incorrect result behavior.

IMO Arrow is exposing too many underlying library's array type to users directly

quinnj commented

Hmmmm, it seems pretty bad form though to be trying to use the indices of one array to index another? Like, I get that having a better error here would be nice, but that just seems like a really big no-no from the start.

Moelf commented

because all file format doesn't have a "vanilla" way to express "these two jagged columns have same number of elements row by row", physicists in my field are somewhat used to doing this.\

maybe we can add a parent() === ... check in getindex()? it should get optimized away

quinnj commented

Alright, I think this is a pretty easy assertion: JuliaData/SentinelArrays.jl#90