[Discussion] Need for early-returning friendly iteration interface
Moelf opened this issue · 1 comments
Moelf commented
this is an example to demonstrate what "early returning" mean following a discussion on Slack with Alexander Plavin
tl;dr
The idea is that you have N columns you filter on, however, maybe 20% of the N columns are enough to early fail 80% of the rows. We need an ergonomic and 0-overhead interface to delay the allocation (or worse, when Feather is compressed) as much as possible.
Setup
julia> using Arrow, BenchmarkTools
julia> function gendata()
x = [rand(rand(0:10)) for _ = 1:10^5]
y = [randn(rand(0:10)) for _ = 1:10^5]
(;x, y)
end
julia> foreach(1:10) do _
Arrow.append("./out.feather", gendata())
end
Benchmark
julia> const tbl = Arrow.Table("out.feather");
julia> function kernel1(xs, ys)
s1 = maximum(ys; init=0.0)
s1 < 0.95 && return false
maximum(xs; init=0.0) < 0.7 && return false
return true
end
julia> @benchmark map(kernel1, tbl.x, tbl.y)
BenchmarkTools.Trial: 19 samples with 1 evaluation.
Range (min … max): 264.955 ms … 271.889 ms ┊ GC (min … max): 3.83% … 4.93%
Time (median): 267.430 ms ┊ GC (median): 3.82%
Time (mean ± σ): 267.704 ms ± 2.096 ms ┊ GC (mean ± σ): 4.17% ± 0.47%
▁ ▁ ██ ▁▁ ▁ ▁ █▁ ▁ ▁ ▁ ▁ ▁ ▁
█▁▁▁█▁██▁▁▁▁██▁▁▁▁▁█▁█▁▁▁▁▁██▁▁█▁▁█▁▁▁▁▁█▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁█▁█ ▁
265 ms Histogram: frequency by time 272 ms <
Memory estimate: 192.34 MiB, allocs estimate: 2000004.
julia> function method2(xss, yss)
map(axes(yss, 1)) do i
ys = yss[i]
s1 = maximum(ys; init=0.0)
s1 < 0.95 && return false
xs = xss[i]
maximum(xs; init=0.0) < 0.7 && return false
return true
end
end
julia> @benchmark method2(tbl.x, tbl.y)
BenchmarkTools.Trial: 25 samples with 1 evaluation.
Range (min … max): 197.487 ms … 219.503 ms ┊ GC (min … max): 2.89% … 2.82%
Time (median): 200.121 ms ┊ GC (median): 2.95%
Time (mean ± σ): 202.677 ms ± 6.567 ms ┊ GC (mean ± σ): 3.21% ± 0.48%
█▁ █ ▁ ▄
██▁█▁▆▆█▆▁▁▁▁▁▁▆█▆▁▁▁▁▁▆▁▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▆▁▁▆ ▁
197 ms Histogram: frequency by time 220 ms <
Memory estimate: 148.03 MiB, allocs estimate: 1536913.
julia> map(kernel1, tbl.x, tbl.y) == method2(tbl.x, tbl.y)
true
Moelf commented
My understanding is maybe https://juliaarrays.github.io/StructArrays.jl/stable/#Lazy-row-iteration can be helpful
edit, it is:
julia> using Arrow, StructArrays
julia> const tbl = Arrow.Table("out.feather");
julia> const sa = @time StructArray(Arrow.Tables.columntable(tbl));
0.228839 seconds (558.01 k allocations: 35.017 MiB, 7.08% gc time, 96.61% compilation time)
julia> @time StructArray(Arrow.Tables.columntable(tbl));
0.000041 seconds (35 allocations: 1.781 KiB)
julia> function kernel3(df)
map(LazyRows(df)) do row
ys = row.y
s1 = maximum(ys; init=0.0)
s1 < 0.95 && return false
xs = row.x
maximum(xs; init=0.0) < 0.7 && return false
return true
end
end
julia> @time kernel3(sa);
0.260317 seconds (1.54 M allocations: 148.073 MiB, 4.75% gc time)