bytecodealliance/regalloc2

regalloc2 makes too many memory allocations

Amanieu opened this issue · 3 comments

I used heaptrack to analyze the memory allocation patterns of my compiler which uses regalloc2. I used a large benchmark (LLVM library) which has 1257955 basic blocks in 104913 functions. Out of a total of 46780667 allocations, 46779011 of them come from regalloc2 and only 1656 come from other parts of the compiler.

My compiler uses a similar structure 1 to Cranelift where a Context structure allows memory allocations to be reused across compilation units. Specifically, this allows various Vecs and HashMaps to grow to a sufficient size only once while being reused many times for compiling multiple functions.

regalloc2 unfortunately doesn't follow this design:

  • A fresh Env structure is allocated and freed for each function. No memory allocations are preserved.
  • Heavy use of temporary SmallVecs and HashSets. Heaptrack reports that 19865290 allocations (42% of the total) are "temporary". These are allocations which are immediately followed by a free, with no other allocations in between.
  • Heavy use of BTreeMap which needs to allocate many nodes separately and cannot reuse memory unlike HashMap and Vec.

Heaptrack results

Here are the major allocation hotspots shown by heaptrack:

In try_to_allocate_bundle_to_reg:

  • 16158324 temporary allocations from the FxHashSet.
  • 9677435 allocations from inserting liveranges into PRegData::allocations (LiveRangeSet backed by a BTreeMap).

In merge_bundles:

  • 4478881 allocations from pushing a VReg to SpillSet::vregs (SmallVec).
  • 1895443 allocations from appending LiveRangeListEntrys to LiveBundle::ranges (SmallVec).

In insert_use_into_liverange:

  • 2578957 allocations from pushing a Use to LiveRange::uses (SmallVec).

In add_liverange_to_vreg:

  • 1432853 allocations from pushing a LiveRangeListEntry to VRegData::ranges (SmallVec).

In create_pregs_and_vregs:

  • 948582 allocations from pushing to Env::vregs and Env::allocs and Env::inst_alloc_offsets. (Vec)

In Env::new:

  • 1154043 allocations from initializing the various Vecs with Vec::with_capacity.

Performance impact

Benchmarking shows that approximately 10% of the time is spent in memory allocation functions (malloc, realloc, free). Additionally, another 10% of the time is spent in memcpy/memmove which could be due to vector reallocation.

Compiling with multiple threads can cause additional contention on the global allocator. This was noticed on musl targets where the global allocator doesn't have thread-local caches: compiler performance on 6 threads (on a 6-core machine) was reduced to that of a single thread whereas other allocators allow almost-perfect performance scaling.

Proposed solution

I believe the best way to improve regalloc2 would be to use a design similar to Cranelift with a Context structure that can preserve memory allocation across multiple invocations of the register allocator.

  • Temporary SmallVecs and HashSets should be replaced with a single instance in Context.
  • The many BTreeMaps should be replaced with cranelift-bforest which maintains a pool of nodes that can be reused.
  • The many SmallVecs should be replaced with EntityList from cranelift-entity and a ListPool in Context.

Since this would require a dependency on the cranelift-entity and cranelift-bforest crates, this is also a good opportunity to resolve #62 and switch to using proper indexed types internally.

Footnotes

  1. In fact I am using the cranelift-entity crate directly, it's very well written.

Thanks for the analysis and write up! Very similar to what I've seen when profiling cranelift.

I believe the best way to improve regalloc2 would be to use a design similar to Cranelift with a Context structure that can preserve memory allocation across multiple invocations of the register allocator.

* Temporary `SmallVec`s and `HashSet`s should be replaced with a single instance in `Context`.

The way cranelift works is that each function is compiled on a separate thread effectively so there isn't a great way for cranelift at least to reuse these contexts since each function compilation only calls regalloc once.

I wonder how many of these temporary allocations could be reused within a single invocation of the register allocator though?

The many SmallVecs should be replaced with EntityList from cranelift-entity and a ListPool in Context.

There are some restrictions that EntityList places on its elements: they have to be entity ids themselves. Not sure if that constraint will work for regalloc2's use cases or not. If not, we can look into lifting those restrictions or creating a new thing specifically for regalloc2's use cases.

The way cranelift works is that each function is compiled on a separate thread effectively so there isn't a great way for cranelift at least to reuse these contexts since each function compilation only calls regalloc once.

This is not mandatory. For example cg_clif compiles all functions in the same cgu on the same thread, reusing the Context within the same cgu. For wasmtime you could use thread local storage I guess. Or maybe rayon has a better way?

For what's it's worth, I made a PR some time ago that allows to reuse cranelift's contexts across compilations of different functions on different threads: https://github.com/bytecodealliance/wasmtime/blob/main/crates/cranelift/src/compiler.rs#L52

So if the regalloc context was stored in the cranelift context, it looks like it could be reused.