apache/arrow-julia

[Discussion] Need for early-returning friendly iteration interface

Moelf opened this issue · 1 comments

Moelf commented

this is an example to demonstrate what "early returning" mean following a discussion on Slack with Alexander Plavin

tl;dr

The idea is that you have N columns you filter on, however, maybe 20% of the N columns are enough to early fail 80% of the rows. We need an ergonomic and 0-overhead interface to delay the allocation (or worse, when Feather is compressed) as much as possible.

Setup

julia> using Arrow, BenchmarkTools

julia> function gendata()
           x = [rand(rand(0:10)) for _ = 1:10^5]
           y = [randn(rand(0:10)) for _ = 1:10^5]
           (;x, y)
       end

julia> foreach(1:10) do _
           Arrow.append("./out.feather", gendata())
       end

Benchmark

julia> const tbl = Arrow.Table("out.feather");

julia> function kernel1(xs, ys)
           s1 = maximum(ys; init=0.0)
           s1 < 0.95 && return false

           maximum(xs; init=0.0) < 0.7 && return false
           return true
       end

julia> @benchmark map(kernel1, tbl.x, tbl.y)
BenchmarkTools.Trial: 19 samples with 1 evaluation.
 Range (min … max):  264.955 ms … 271.889 ms  ┊ GC (min … max): 3.83%4.93%
 Time  (median):     267.430 ms               ┊ GC (median):    3.82%
 Time  (mean ± σ):   267.704 ms ±   2.096 ms  ┊ GC (mean ± σ):  4.17% ± 0.47%

  ▁   ▁ ██    ▁▁     ▁ ▁     █▁  ▁  ▁     ▁         ▁       ▁ ▁
  █▁▁▁█▁██▁▁▁▁██▁▁▁▁▁█▁█▁▁▁▁▁██▁▁█▁▁█▁▁▁▁▁█▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁█▁█ ▁
  265 ms           Histogram: frequency by time          272 ms <

 Memory estimate: 192.34 MiB, allocs estimate: 2000004.

julia> function method2(xss, yss)
           map(axes(yss, 1)) do i
               ys = yss[i]
               s1 = maximum(ys; init=0.0)
               s1 < 0.95 && return false

               xs = xss[i]
               maximum(xs; init=0.0) < 0.7 && return false
               return true
           end
       end

julia> @benchmark method2(tbl.x, tbl.y)
BenchmarkTools.Trial: 25 samples with 1 evaluation.
 Range (min … max):  197.487 ms … 219.503 ms  ┊ GC (min … max): 2.89%2.82%
 Time  (median):     200.121 ms               ┊ GC (median):    2.95%
 Time  (mean ± σ):   202.677 ms ±   6.567 ms  ┊ GC (mean ± σ):  3.21% ± 0.48%

  █▁ █   ▁        ▄
  ██▁█▁▆▆█▆▁▁▁▁▁▁▆█▆▁▁▁▁▁▆▁▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▆▁▁▆ ▁
  197 ms           Histogram: frequency by time          220 ms <

 Memory estimate: 148.03 MiB, allocs estimate: 1536913.

julia> map(kernel1, tbl.x, tbl.y) == method2(tbl.x, tbl.y)
true
Moelf commented

My understanding is maybe https://juliaarrays.github.io/StructArrays.jl/stable/#Lazy-row-iteration can be helpful

edit, it is:

julia> using Arrow, StructArrays

julia> const tbl = Arrow.Table("out.feather");

julia> const sa = @time StructArray(Arrow.Tables.columntable(tbl));
  0.228839 seconds (558.01 k allocations: 35.017 MiB, 7.08% gc time, 96.61% compilation time)

julia> @time StructArray(Arrow.Tables.columntable(tbl));
  0.000041 seconds (35 allocations: 1.781 KiB)

julia> function kernel3(df)
           map(LazyRows(df)) do row
               ys = row.y
               s1 = maximum(ys; init=0.0)
               s1 < 0.95 && return false

               xs = row.x
               maximum(xs; init=0.0) < 0.7 && return false
               return true
           end
       end
julia> @time kernel3(sa);
  0.260317 seconds (1.54 M allocations: 148.073 MiB, 4.75% gc time)