`Arrow.write` performance on large DataFrame
jariji opened this issue · 3 comments
jariji commented
Arrow.write
is taking a long time (over an hour, still going) on a 25M by ~100 (floats, ints, inlinestrings, missings) DataFrame to an NVMe SSD. Is this to be expected?
I have InlineStrings #main
(after JuliaStrings/InlineStrings.jl#66) installed.
julia> versioninfo()
Julia Version 1.9.0
Commit 8e630552924 (2023-05-07 11:25 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 24 × AMD Ryzen 9 3900XT 12-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, znver2)
Threads: 24 on 24 virtual cores
[69666777] Arrow v2.6.2
[a93c6f00] DataFrames v1.5.0
[842dd82b] InlineStrings v1.4.0 `https://github.com/JuliaStrings/InlineStrings.jl#main`
Moelf commented
julia> 25 * 10^6 * 8/1024^3
0.1862645149230957
roughly 0.18GB? yeah, that's way too slow. Is there anyway to generate same schema with dummy data?
jariji commented
Base.summarysize(df)
says 81 GB.
julia> size(df)
(23558194, 71)
julia> countmap(eltype.(eachcol(df)))
Dict{Type, Int64} with 14 entries:
Int64 => 7
Union{Missing, String15} => 4
Union{Missing, String127} => 11
Union{Missing, String63} => 3
String15 => 5
Union{Missing, Int64} => 2
String63 => 3
String7 => 6
String127 => 2
Missing => 14
String255 => 1
String31 => 6
Union{Missing, String31} => 5
Union{Missing, String255} => 2
It takes ~1 second to write ~1 GB, so naively, writing 81 GB shouldn't take very long either; I expect some overhead but hours is a lot.
julia> let r = rand(UInt8, 1024^3)
@time open("data/temp", "w") do f
write(f, r)
end
end;
1.118401 seconds (2.61 k allocations: 190.905 KiB, 0.87% compilation time)
jariji commented
Computer is swapping.