`eachindex()` on columns producing unexpected result

Question

`eachindex()` on columns producing unexpected result

Moelf opened this issue 2 years ago · 7 comments

eachindex() can return useful index or garbage depending on the size (more precisely, # of batches) of the Arrow file:

julia> using Arrow

julia> function gendata()
           x = [rand(rand(0:10)) for _ = 1:3]
           y = [randn(rand(0:10)) for _ = 1:3]
           (;x, y)
       end

julia> Arrow.append("./out2.feather", gendata())
"./out2.feather"

julia> tbl2 = Arrow.Table("./out2.feather")
Arrow.Table with 3 rows, 2 columns, and schema:
 :x  Vector{Float64} (alias for Array{Float64, 1})
 :y  Vector{Float64} (alias for Array{Float64, 1})

julia> tbl2.x
3-element Arrow.List{Vector{Float64}, Int32, Arrow.Primitive{Float64, Vector{Float64}}}:
 [0.38906822055013135]
 [0.9737332599073911, 0.19757843080977833, 0.06268161159155239, 0.9716472746828138, 0.8838743377016612, 0.951662206335832, 0.5973352614611379, 0.8876145238421496, 0.5149018084376046]
 [0.9363691598185577, 0.3603564376862347, 0.6425331531623392, 0.4465148469221817, 0.9705740924873874, 0.7543076133441144, 0.9591142326905828, 0.1521499948934163, 0.49812895258696377]

julia> eachindex(tbl2.x)
Base.OneTo(3)

now append more data to create multiple batches

julia> foreach(1:2) do _
           Arrow.append("./out2.feather", gendata())
       end

julia> tbl2 = Arrow.Table("./out2.feather")
Arrow.Table with 6 rows, 2 columns, and schema:
 :x  Vector{Float64} (alias for Array{Float64, 1})
 :y  Vector{Float64} (alias for Array{Float64, 1})

julia> eachindex(tbl2.x)
SentinelArrays.IndexIterator{Arrow.List{Vector{Float64}, Int32, Arrow.Primitive{Float64, Vector{Float64}}}}(Arrow.List{Vector{Float64}, Int32, Arrow.Primitive{Float64, Vector{Float64}}}[[[0.23351930418379585, 0.260224999331924, 0.999696889776187], [0.6032929047338089, 0.033296127748308146, 0.8605671394492301], []], [[], [0.5499242945328698], [0.9983428836825877, 0.4739924450948647, 0.5605478962869866, 0.07106076661873084, 0.6498518642395072, 0.548789522281429, 0.2140222718354795]]]

I understand this is because SentinelArrays, but this is really unintuitive

Answer 1 · 2023-04-09T12:02:13.000Z

What do you mean by garbage exactly, the verbose type? Does it work to actually index with the result? (Not sure if this is actually a bug)

Answer 2 · 2023-04-09T20:10:19.000Z

it works but users have to be extremely careful:

julia> [tbl2.x[i] for i in eachindex(tbl2.x)] == collect(tbl2.x)
true

julia> [tbl2.y[i] for i in eachindex(tbl2.y)] == collect(tbl2.y)
true

julia> [tbl2.y[i] for i in eachindex(tbl2.x)] == collect(tbl2.y)
false
julia> [tbl2.x[i] for i in eachindex(tbl2.y)] == collect(tbl2.x)
false

the last 2 lines break because SentinelArrays.IndexIterator which only happens when the table have multiple batches.

Notice, if the user remembered to use eachindex(tbl2.x, tbl2.y) it would normalize to OneTo(6) and everything would be fine. But the problem is neither of the last 2 lines threw error, just silently corrupted result

Answer 3 · 2023-04-09T21:34:52.000Z

Ah, yeah, that seems non-ideal, although maybe not directly a bug.

Maybe some options here are to change the return here, change that special index so it errors when used with another vector, or if neither of those can be done performantly, just try to document this behavior.

Answer 4 · 2023-04-09T21:40:11.000Z

I don't mind what SentinelArrays.jl do and I get it's for performance.

But just the fact that given the same Arrow schema, the type of eachindex() of the same column changes depending on if the file contains 1 batch or more than 1 batch seems bad enough already, and there's the potential silent incorrect result behavior.

IMO Arrow is exposing too many underlying library's array type to users directly

Answer 5 · 2023-06-06T04:42:52.000Z

Hmmmm, it seems pretty bad form though to be trying to use the indices of one array to index another? Like, I get that having a better error here would be nice, but that just seems like a really big no-no from the start.

Answer 6 · 2023-06-06T04:57:18.000Z

because all file format doesn't have a "vanilla" way to express "these two jagged columns have same number of elements row by row", physicists in my field are somewhat used to doing this.\

maybe we can add a parent() === ... check in getindex()? it should get optimized away

Answer 7 · 2023-06-06T22:54:04.000Z

Alright, I think this is a pretty easy assertion: JuliaData/SentinelArrays.jl#90