apache/arrow-julia

`Arrow.write` performance on large DataFrame

jariji opened this issue · 3 comments

jariji commented

Arrow.write is taking a long time (over an hour, still going) on a 25M by ~100 (floats, ints, inlinestrings, missings) DataFrame to an NVMe SSD. Is this to be expected?

I have InlineStrings #main (after JuliaStrings/InlineStrings.jl#66) installed.

julia> versioninfo()
Julia Version 1.9.0
Commit 8e630552924 (2023-05-07 11:25 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × AMD Ryzen 9 3900XT 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver2)
  Threads: 24 on 24 virtual cores

  [69666777] Arrow v2.6.2
  [a93c6f00] DataFrames v1.5.0
  [842dd82b] InlineStrings v1.4.0 `https://github.com/JuliaStrings/InlineStrings.jl#main`
Moelf commented
julia> 25 * 10^6 * 8/1024^3
0.1862645149230957

roughly 0.18GB? yeah, that's way too slow. Is there anyway to generate same schema with dummy data?

jariji commented

Base.summarysize(df) says 81 GB.

julia> size(df)
(23558194, 71)

julia> countmap(eltype.(eachcol(df)))
Dict{Type, Int64} with 14 entries:
  Int64                     => 7
  Union{Missing, String15}  => 4
  Union{Missing, String127} => 11
  Union{Missing, String63}  => 3
  String15                  => 5
  Union{Missing, Int64}     => 2
  String63                  => 3
  String7                   => 6
  String127                 => 2
  Missing                   => 14
  String255                 => 1
  String31                  => 6
  Union{Missing, String31}  => 5
  Union{Missing, String255} => 2

It takes ~1 second to write ~1 GB, so naively, writing 81 GB shouldn't take very long either; I expect some overhead but hours is a lot.

julia> let r = rand(UInt8, 1024^3)
           @time open("data/temp", "w") do f
               write(f, r)
           end
       end;
  1.118401 seconds (2.61 k allocations: 190.905 KiB, 0.87% compilation time)
jariji commented

Computer is swapping.