apache/arrow-julia

Unexpected allocations

JoaoAparicio opened this issue · 2 comments

I've noticed that this allocates and I'm surprised.

]activate --temp
]add Arrow
struct IntWrapper
    data::Int64
end

const INTWRAPPER_NAME = Symbol("JuliaLang.IntWrapper")
ArrowTypes.ArrowKind(::Type{IntWrapper}) = ArrowTypes.PrimitiveKind()
ArrowTypes.ArrowType(::Type{IntWrapper}) = Int64
ArrowTypes.toarrow(x::IntWrapper) = x.data
ArrowTypes.arrowname(::Type{IntWrapper}) = INTWRAPPER_NAME
ArrowTypes.JuliaType(::Val{INTWRAPPER_NAME}, ::Type{Int64}) = IntWrapper
ArrowTypes.fromarrow(::Type{IntWrapper}, x::Int64) = reinterpret(IntWrapper, x)

x = [IntWrapper(1) for _ in 1:8_000_000];
@time Arrow.write("/tmp/temp.arrow", (x=x,))

I get (after running it once to compile):

 0.401526 seconds (8.00 M allocations: 184.254 MiB, 7.14% gc time)

Basically one allocation per element of the vector.

Compare this with the cost of saving just ints without the wrapper:

x = ones(Int,8_000_000);
@time Arrow.write("/tmp/temp.arrow", (x=x,))
0.056106 seconds (140 allocations: 11.461 KiB)

Am I doing something wrong?

I've reproduced this for:

julia 1.10.0 + Arrow 2.7.0
julia 1.9.0 + Arrow 2.6.2
julia 1.8.5 + Arrow 2.6.2
julia 1.8.5 + Arrow 2.5.0
julia 1.7.3 + Arrow 2.2.0
julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 48 × Intel(R) Xeon(R) Gold 6136 CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake-avx512)
  Threads: 1 on 48 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 4

Manifest

st -m
Status `/tmp/jl_mwcRrJ/Manifest.toml`
  [69666777] Arrow v2.7.0
  [31f734f8] ArrowTypes v2.3.0
  [c3b6d118] BitIntegers v0.3.1
  [5ba52731] CodecLz4 v0.4.1
  [6b39b394] CodecZstd v0.8.2
  [34da2185] Compat v4.12.0
  [f0e56b4a] ConcurrentUtilities v2.3.0
  [9a962f9c] DataAPI v1.15.0
  [e2d170a0] DataValueInterfaces v1.0.0
  [4e289a0a] EnumX v1.0.4
  [e2ba6199] ExprTools v0.1.10
  [842dd82b] InlineStrings v1.4.0
  [82899510] IteratorInterfaceExtensions v1.0.0
  [692b3bcd] JLLWrappers v1.5.0
  [e6f89c97] LoggingExtras v1.0.3
  [78c3b35d] Mocking v0.7.7
  [bac558e1] OrderedCollections v1.6.3
  [69de0a69] Parsers v2.8.1
  [2dfb63ee] PooledArrays v1.4.3
  [aea7be01] PrecompileTools v1.2.0
  [21216c6a] Preferences v1.4.1
  [6c6a2e73] Scratch v1.2.1
  [91c51154] SentinelArrays v1.4.1
  [dc5dba14] TZJData v1.0.0+2023c
  [3783bdb8] TableTraits v1.0.1
  [bd369af6] Tables v1.11.1
  [f269a46b] TimeZones v1.13.0
  [3bb67fe8] TranscodingStreams v0.10.2
  [5ced341a] Lz4_jll v1.9.4+0
  [3161d3a3] Zstd_jll v1.5.5+0
  [0dad84c5] ArgTools v1.1.1
  [56f22d72] Artifacts
  [2a0f44e3] Base64
  [ade2ca70] Dates
  [f43a241f] Downloads v1.6.0
  [7b1f6079] FileWatching
  [9fa8497b] Future
  [b77e0a4c] InteractiveUtils
  [4af54fe1] LazyArtifacts
  [b27032c2] LibCURL v0.6.4
  [76f85450] LibGit2
  [8f399da3] Libdl
  [37e2e46d] LinearAlgebra
  [56ddb016] Logging
  [d6f4376e] Markdown
  [a63ad114] Mmap
  [ca575930] NetworkOptions v1.2.0
  [44cfe95a] Pkg v1.10.0
  [de0858da] Printf
  [3fa0cd96] REPL
  [9a3f8284] Random
  [ea8e919c] SHA v0.7.0
  [9e88b42a] Serialization
  [6462fe0b] Sockets
  [fa267f1f] TOML v1.0.3
  [a4e569a6] Tar v1.10.0
  [cf7118a7] UUIDs
  [4ec0a83e] Unicode
  [e66e0078] CompilerSupportLibraries_jll v1.0.5+1
  [deac9b47] LibCURL_jll v8.4.0+0
  [e37daf67] LibGit2_jll v1.6.4+0
  [29816b5a] LibSSH2_jll v1.11.0+1
  [c8ffd9c3] MbedTLS_jll v2.28.2+1
  [14a3606d] MozillaCACerts_jll v2023.1.10
  [4536629a] OpenBLAS_jll v0.3.23+2
  [83775a58] Zlib_jll v1.2.13+1
  [8e850b90] libblastrampoline_jll v5.8.0+1
  [8e850ede] nghttp2_jll v1.52.0+1
  [3f19e933] p7zip_jll v17.4.0+2

One difference that I've noticed between Vector{Int64} and the Vector{IntWrapper} cases, is on entering this function

arrow-julia/src/utils.jl

Lines 34 to 57 in 3712291

function writearray(io::IO, ::Type{T}, col) where {T}
if col isa Vector{T}
n = Base.write(io, col)
elseif isbitstype(T) && (
col isa Vector{Union{T,Missing}} || col isa SentinelVector{T,T,Missing,Vector{T}}
)
# need to write the non-selector bytes of isbits Union Arrays
n = Base.unsafe_write(io, pointer(col), sizeof(T) * length(col))
elseif col isa ChainedVector
n = 0
for A in col.arrays
n += writearray(io, T, A)
end
else
n = 0
data = Vector{UInt8}(undef, sizeof(col))
buf = IOBuffer(data; write=true)
for x in col
n += Base.write(buf, coalesce(x, ArrowTypes.default(T)))
end
n = Base.write(io, take!(buf))
end
return n
end

In the first case, col is type Vector{Int64} and matches the first if case, the in the second col is type ArrowTypes.ToArrow{Int64,Vector{IntWrapper}} and matches the last. This allocates because
data = Vector{UInt8}(undef, sizeof(col))
won't know the size to be allocated at compile time.

However at this stage something already went wrong, I believe. By inserting prints I can ask for the sizeof of col which in the first case is the whole vector, but in the second case it's just 8 which I guess is the number of bytes for a single Int64.

It looks like @quinnj added the logic to write to a temporary vector prior to writing the vector to the IO. I don't understand why this is required. @quinnj - do you remember?

https://github.com/apache/arrow-julia/pull/57/files#diff-47c27891e951c8cd946b850dc2df31082624afdf57446c21cb6992f5f4b74aa2R47-R52