DuckLogic/zerogc

Simple: Thread local caches to avoid atomic overhead of `Chunk` in small object arenas

Techcable opened this issue · 3 comments

Right now allocation requires atomic operations. We should use a thread-local buffer so this isn't required in the common-case.

This would be somewhat difficult to do since we can have multiple running instances. Would we have to use thread_local? How does the performance overhead of that compare to using atomics?

Maybe we could make SmallObjectCache a static variable shared between instances. However, if we do that we'd have to differentiate between allocations from different collectors (in for_each and the linked list). This could also result in worse performance if all the different collectors end up messing with each others stuff.

This was much easier when it was just a comment in a config file.......

Maybe we could just make the cache the cold path. I think that would happen naturally if we implement Lazy Sweeping (#7). Is it acceptable if we still use atomics when allocating from the free list? We already use a loop there.....

Maybe do TLAB allocation? Some of Dora GCs and some JVM GCs use it. What Dora does is that it allocates 32KB memory for TLAB and if object fits into tlab (<8KB) then runtime allocates in tlab memory, this memory is not recycled but can be traced just fine.

TLAB is definitely an option I'm considering. I'm planning to look into this more deeply after I get multi threading support working.