Unexpected allocations
JoaoAparicio opened this issue · 2 comments
I've noticed that this allocates and I'm surprised.
]activate --temp
]add Arrow
struct IntWrapper
data::Int64
end
const INTWRAPPER_NAME = Symbol("JuliaLang.IntWrapper")
ArrowTypes.ArrowKind(::Type{IntWrapper}) = ArrowTypes.PrimitiveKind()
ArrowTypes.ArrowType(::Type{IntWrapper}) = Int64
ArrowTypes.toarrow(x::IntWrapper) = x.data
ArrowTypes.arrowname(::Type{IntWrapper}) = INTWRAPPER_NAME
ArrowTypes.JuliaType(::Val{INTWRAPPER_NAME}, ::Type{Int64}) = IntWrapper
ArrowTypes.fromarrow(::Type{IntWrapper}, x::Int64) = reinterpret(IntWrapper, x)
x = [IntWrapper(1) for _ in 1:8_000_000];
@time Arrow.write("/tmp/temp.arrow", (x=x,))
I get (after running it once to compile):
0.401526 seconds (8.00 M allocations: 184.254 MiB, 7.14% gc time)
Basically one allocation per element of the vector.
Compare this with the cost of saving just ints without the wrapper:
x = ones(Int,8_000_000);
@time Arrow.write("/tmp/temp.arrow", (x=x,))
0.056106 seconds (140 allocations: 11.461 KiB)
Am I doing something wrong?
I've reproduced this for:
julia 1.10.0 + Arrow 2.7.0
julia 1.9.0 + Arrow 2.6.2
julia 1.8.5 + Arrow 2.6.2
julia 1.8.5 + Arrow 2.5.0
julia 1.7.3 + Arrow 2.2.0
julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 48 × Intel(R) Xeon(R) Gold 6136 CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, skylake-avx512)
Threads: 1 on 48 virtual cores
Environment:
JULIA_EDITOR = code
JULIA_NUM_THREADS = 4
Manifest
st -m
Status `/tmp/jl_mwcRrJ/Manifest.toml`
[69666777] Arrow v2.7.0
[31f734f8] ArrowTypes v2.3.0
[c3b6d118] BitIntegers v0.3.1
[5ba52731] CodecLz4 v0.4.1
[6b39b394] CodecZstd v0.8.2
[34da2185] Compat v4.12.0
[f0e56b4a] ConcurrentUtilities v2.3.0
[9a962f9c] DataAPI v1.15.0
[e2d170a0] DataValueInterfaces v1.0.0
[4e289a0a] EnumX v1.0.4
[e2ba6199] ExprTools v0.1.10
[842dd82b] InlineStrings v1.4.0
[82899510] IteratorInterfaceExtensions v1.0.0
[692b3bcd] JLLWrappers v1.5.0
[e6f89c97] LoggingExtras v1.0.3
[78c3b35d] Mocking v0.7.7
[bac558e1] OrderedCollections v1.6.3
[69de0a69] Parsers v2.8.1
[2dfb63ee] PooledArrays v1.4.3
[aea7be01] PrecompileTools v1.2.0
[21216c6a] Preferences v1.4.1
[6c6a2e73] Scratch v1.2.1
[91c51154] SentinelArrays v1.4.1
[dc5dba14] TZJData v1.0.0+2023c
[3783bdb8] TableTraits v1.0.1
[bd369af6] Tables v1.11.1
[f269a46b] TimeZones v1.13.0
[3bb67fe8] TranscodingStreams v0.10.2
[5ced341a] Lz4_jll v1.9.4+0
[3161d3a3] Zstd_jll v1.5.5+0
[0dad84c5] ArgTools v1.1.1
[56f22d72] Artifacts
[2a0f44e3] Base64
[ade2ca70] Dates
[f43a241f] Downloads v1.6.0
[7b1f6079] FileWatching
[9fa8497b] Future
[b77e0a4c] InteractiveUtils
[4af54fe1] LazyArtifacts
[b27032c2] LibCURL v0.6.4
[76f85450] LibGit2
[8f399da3] Libdl
[37e2e46d] LinearAlgebra
[56ddb016] Logging
[d6f4376e] Markdown
[a63ad114] Mmap
[ca575930] NetworkOptions v1.2.0
[44cfe95a] Pkg v1.10.0
[de0858da] Printf
[3fa0cd96] REPL
[9a3f8284] Random
[ea8e919c] SHA v0.7.0
[9e88b42a] Serialization
[6462fe0b] Sockets
[fa267f1f] TOML v1.0.3
[a4e569a6] Tar v1.10.0
[cf7118a7] UUIDs
[4ec0a83e] Unicode
[e66e0078] CompilerSupportLibraries_jll v1.0.5+1
[deac9b47] LibCURL_jll v8.4.0+0
[e37daf67] LibGit2_jll v1.6.4+0
[29816b5a] LibSSH2_jll v1.11.0+1
[c8ffd9c3] MbedTLS_jll v2.28.2+1
[14a3606d] MozillaCACerts_jll v2023.1.10
[4536629a] OpenBLAS_jll v0.3.23+2
[83775a58] Zlib_jll v1.2.13+1
[8e850b90] libblastrampoline_jll v5.8.0+1
[8e850ede] nghttp2_jll v1.52.0+1
[3f19e933] p7zip_jll v17.4.0+2
One difference that I've noticed between Vector{Int64}
and the Vector{IntWrapper}
cases, is on entering this function
Lines 34 to 57 in 3712291
In the first case, col is type Vector{Int64}
and matches the first if
case, the in the second col is type ArrowTypes.ToArrow{Int64,Vector{IntWrapper}}
and matches the last. This allocates because
data = Vector{UInt8}(undef, sizeof(col))
won't know the size to be allocated at compile time.
However at this stage something already went wrong, I believe. By inserting prints I can ask for the sizeof
of col which in the first case is the whole vector, but in the second case it's just 8 which I guess is the number of bytes for a single Int64.