MasonProtter/Bumper.jl

Add/use slab-bump allocator?

chriselrod opened this issue · 5 comments

The basic idea is, you have slabs of some size.
When you run out of memory, you allocate a new slab.

Examples:
llvm: https://llvm.org/doxygen/Allocator_8h_source.html
LoopModels: https://github.com/JuliaSIMD/LoopModels/blob/bumprealloc/include/Utilities/Allocators.hpp

LoopModel's is largely a copy of LLVM's, but supports either a bump-up or bump-down.
LoopModel's slab size is constant, but LLVM's slabs grow.

A julia struct itself could look like

mutable struct BumpAlloc{Up,SlabSize}
    current::Ptr{Cvoid}
    slabend::Ptr{Cvoid}
    # you could try and get fancy and reduce the number of indirection's by having your own array type
    slabs::Vector{Ptr{Cvoid}}
    custom_slabs::Vector{Ptr{Cvoid}}
end
# should probably register a finalizer that `Libc.free`s all the pointers
# optionally use a faster library like `mimalloc` instead of `Libc`

The custom_slabs are for objects too big for the SlabSize.
The point of being separate was largely because in C++ there possibly are possibly faster free/delete functions that take the size (i.e. there might exist, and they might be faster).
Given that we don't have that here, we may as well fuse them, unless you find some allocator API that supports sizes.

Being able to grow lets you default to a much smaller slab size.

I was thinking about modifying SimpleChains to use something like this.

Yeah that'd be great to have here. It definitely should be the default.

I guess overcommiting with mmap may be an even better idea?

julia> Sys.total_memory() / 1<<30
15.319782257080078

julia> Sys.free_memory() / 1<<30
10.188095092773438

julia> using Mmap

julia> x = Mmap.mmap(Vector{UInt8}, Sys.total_physical_memory() + 1<<30); # mmap 1 more gig than total memory

julia> Sys.free_memory() / 1<<30 # not actually using the memory yet
10.179462432861328

I think this might not work on Windows, but you could maybe pick a little bit less than the total memory. You obviously won't be using more than the total memory anyway, so that isn't a real loss.

Basically, the OS gives you as much as you ask for, without actually mapping any particular page to physical memory until you touch it.
So this is better than the slab approach in some sense.

You'll page fault when you hit a new page (but your allocator for the slab has that problem too!), and then the OS gives it to you.

This makes partial resets easier/more efficient.

Hm, that's a really interesting approach. Is this what our GC is doing under the hood too?

I think it's the primary way programs get memory from the OS, e.g. mallocs will mmap memory under the hood.
I haven't looked into how Julia's GC is implemented, but would guess so.
Searching the source code you can find plenty of mmap (aka VirtualAlloc on Windows), e.g. https://github.com/JuliaLang/julia/blob/406ba123cedcfcc66cb50d15bb5eaeb2bcefea5b/src/cgmemmgr.cpp#L38

Also, Vectors in Julia stop copying memory/relocating over a certain size; the OS will just start paging in more memory as needed. I also haven't looked into how that is implemented, but would definitely guess it is using mmap.

Here's a blogpost where someone talks about it in more detail:
https://yakubin.com/notes/comp/reserve-and-commit.html
Julia already takes care of the messiest bits in the Mmap standard library (i.e. it is cross platform), and Julia gives you access to total_memory so you can avoid overcomitting (which can be disabled on Linux, and doesn't work on Windows). It's nice to know what the total memory is so that you can safely undershoot it.

A bit off topic (but on the topic of bump allocators), but I liked this blog post describing arenas and the author's memory management strategy in C (which was more or less what I've also started doing recently):
https://www.rfleury.com/p/untangling-lifetimes-the-arena-allocator
For certain data structures, it's easy to reason about and very fast.

I think partial deallocations so that each arena/bump allocator can work like a stack where you push things to it and pop them off is important.
Although simple enough use cases like SimpleChains doesn't really need that functionality.

@chriselrod turns out that you don't even need mmap for this. A regular undef array will do.

julia> Sys.free_memory() / 1e9
12.138631168

julia> x = Vector{UInt8}(undef, Sys.total_physical_memory());

julia> Sys.free_memory() / 1e9
12.127961088

julia> x[1:end÷4] .= 1;

julia> Sys.free_memory() / 1e9
7.952367616