regalloc2 makes too many memory allocations
Amanieu opened this issue · 3 comments
I used heaptrack to analyze the memory allocation patterns of my compiler which uses regalloc2. I used a large benchmark (LLVM library) which has 1257955 basic blocks in 104913 functions. Out of a total of 46780667 allocations, 46779011 of them come from regalloc2 and only 1656 come from other parts of the compiler.
My compiler uses a similar structure 1 to Cranelift where a Context
structure allows memory allocations to be reused across compilation units. Specifically, this allows various Vec
s and HashMap
s to grow to a sufficient size only once while being reused many times for compiling multiple functions.
regalloc2 unfortunately doesn't follow this design:
- A fresh
Env
structure is allocated and freed for each function. No memory allocations are preserved. - Heavy use of temporary
SmallVec
s andHashSet
s. Heaptrack reports that 19865290 allocations (42% of the total) are "temporary". These are allocations which are immediately followed by a free, with no other allocations in between. - Heavy use of
BTreeMap
which needs to allocate many nodes separately and cannot reuse memory unlikeHashMap
andVec
.
Heaptrack results
Here are the major allocation hotspots shown by heaptrack:
In try_to_allocate_bundle_to_reg
:
- 16158324 temporary allocations from the
FxHashSet
. - 9677435 allocations from inserting liveranges into
PRegData::allocations
(LiveRangeSet
backed by aBTreeMap
).
In merge_bundles
:
- 4478881 allocations from pushing a
VReg
toSpillSet::vregs
(SmallVec
). - 1895443 allocations from appending
LiveRangeListEntry
s toLiveBundle::ranges
(SmallVec
).
In insert_use_into_liverange
:
- 2578957 allocations from pushing a
Use
toLiveRange::uses
(SmallVec
).
In add_liverange_to_vreg
:
- 1432853 allocations from pushing a
LiveRangeListEntry
toVRegData::ranges
(SmallVec
).
In create_pregs_and_vregs
:
- 948582 allocations from pushing to
Env::vregs
andEnv::allocs
andEnv::inst_alloc_offsets
. (Vec
)
In Env::new
:
- 1154043 allocations from initializing the various
Vec
s withVec::with_capacity
.
Performance impact
Benchmarking shows that approximately 10% of the time is spent in memory allocation functions (malloc
, realloc
, free
). Additionally, another 10% of the time is spent in memcpy
/memmove
which could be due to vector reallocation.
Compiling with multiple threads can cause additional contention on the global allocator. This was noticed on musl targets where the global allocator doesn't have thread-local caches: compiler performance on 6 threads (on a 6-core machine) was reduced to that of a single thread whereas other allocators allow almost-perfect performance scaling.
Proposed solution
I believe the best way to improve regalloc2 would be to use a design similar to Cranelift with a Context
structure that can preserve memory allocation across multiple invocations of the register allocator.
- Temporary
SmallVec
s andHashSet
s should be replaced with a single instance inContext
. - The many
BTreeMap
s should be replaced withcranelift-bforest
which maintains a pool of nodes that can be reused. - The many
SmallVec
s should be replaced withEntityList
fromcranelift-entity
and aListPool
inContext
.
Since this would require a dependency on the cranelift-entity
and cranelift-bforest
crates, this is also a good opportunity to resolve #62 and switch to using proper indexed types internally.
Footnotes
-
In fact I am using the
cranelift-entity
crate directly, it's very well written. ↩
Thanks for the analysis and write up! Very similar to what I've seen when profiling cranelift.
I believe the best way to improve regalloc2 would be to use a design similar to Cranelift with a
Context
structure that can preserve memory allocation across multiple invocations of the register allocator.* Temporary `SmallVec`s and `HashSet`s should be replaced with a single instance in `Context`.
The way cranelift works is that each function is compiled on a separate thread effectively so there isn't a great way for cranelift at least to reuse these contexts since each function compilation only calls regalloc once.
I wonder how many of these temporary allocations could be reused within a single invocation of the register allocator though?
The many
SmallVec
s should be replaced withEntityList
fromcranelift-entity
and aListPool
inContext
.
There are some restrictions that EntityList
places on its elements: they have to be entity ids themselves. Not sure if that constraint will work for regalloc2's use cases or not. If not, we can look into lifting those restrictions or creating a new thing specifically for regalloc2's use cases.
The way cranelift works is that each function is compiled on a separate thread effectively so there isn't a great way for cranelift at least to reuse these contexts since each function compilation only calls regalloc once.
This is not mandatory. For example cg_clif compiles all functions in the same cgu on the same thread, reusing the Context
within the same cgu. For wasmtime you could use thread local storage I guess. Or maybe rayon has a better way?
For what's it's worth, I made a PR some time ago that allows to reuse cranelift's contexts across compilations of different functions on different threads: https://github.com/bytecodealliance/wasmtime/blob/main/crates/cranelift/src/compiler.rs#L52
So if the regalloc context was stored in the cranelift context, it looks like it could be reused.