`GC_collect` tries to realloc (oops)

Question

`GC_collect` tries to realloc (oops)

Closed this issue 2 years ago · 4 comments

Here is an interesting stacktrace I just got:

#0  []= () at /usr/share/crystal/src/pointer.cr:130
#1  push () at /usr/share/crystal/src/deque.cr:480
#2  0x000055555558a3af in lock_slow () at /usr/share/crystal/src/mutex.cr:70
#3  0x000055555558d882 in lock () at /usr/share/crystal/src/mutex.cr:51
#4  collect () at /usr/share/crystal/src/mutex.cr:132
#5  0x0000555555591e5c in GlobalAllocator_tryCollect (self=0x5555557a2290) at src/global_allocator.c:151
#6  GC_GlobalAllocator_nextBlock (self=0x5555557a2290) at src/global_allocator.c:175
#7  0x0000555555592a3d in LocalAllocator_initCursor (self=0x5555557a2530) at src/local_allocator.c:10
#8  GC_LocalAllocator_allocateSmall (self=0x5555557a2530, size=32, atomic=0) at src/local_allocator.c:135
#9  0x000055555556328c in malloc () at /home/github/immix.cr/src/immix.cr:23
#10 __crystal_malloc64 () at /usr/share/crystal/src/gc.cr:24
#11 malloc () at /usr/share/crystal/src/primitives.cr:164
#12 increase_capacity () at /usr/share/crystal/src/deque.cr:587
#13 push () at /usr/share/crystal/src/deque.cr:477
#14 0x000055555558a3af in lock_slow () at /usr/share/crystal/src/mutex.cr:70
#15 0x000055555558d882 in lock () at /usr/share/crystal/src/mutex.cr:51
#16 collect () at /usr/share/crystal/src/mutex.cr:132
#17 0x0000555555591e5c in GlobalAllocator_tryCollect (self=0x5555557a2290) at src/global_allocator.c:151
#18 GC_GlobalAllocator_nextBlock (self=0x5555557a2290) at src/global_allocator.c:175
#19 0x0000555555592a3d in LocalAllocator_initCursor (self=0x5555557a2530) at src/local_allocator.c:10
#20 GC_LocalAllocator_allocateSmall (self=0x5555557a2530, size=24, atomic=0) at src/local_allocator.c:135
#21 0x000055555556f4b2 in malloc () at /home/github/immix.cr/src/immix.cr:23
#22 __crystal_malloc64 () at /usr/share/crystal/src/gc.cr:24
#23 new () at /usr/share/crystal/src/deque.cr:25
#24 wait_readable () at /usr/share/crystal/src/io/evented.cr:125
#25 wait_readable () at /usr/share/crystal/src/io/evented.cr:119
#26 unbuffered_read () at /usr/share/crystal/src/io/evented.cr:59
#27 0x000055555556f349 in fill_buffer () at /usr/share/crystal/src/io/buffered.cr:261
#28 read () at /usr/share/crystal/src/io/buffered.cr:82
#29 0x00005555555700e0 in read_fully? () at /usr/share/crystal/src/io.cr:537
#30 read_fully () at /usr/share/crystal/src/io.cr:520
#31 0x0000555555587aa5 in decode () at /usr/share/crystal/src/io/byte_format.cr:123
#32 from_io () at /usr/share/crystal/src/int.cr:759
#33 read_bytes () at /usr/share/crystal/src/io.cr:913
#34 -> () at /usr/share/crystal/src/signal.cr:244
#35 0x00005555555865e9 in run () at /usr/share/crystal/src/fiber.cr:146
#36 0x0000000000000000 in ?? ()

We can see that we try to allocate (#23: Deque.new) but the HEAP has been filled, so the GC starts a collection, and calls GC_collect to collect all the stacks to walk, but... the Deque is too small and we try to increase it's capacity (#12), which tries to allocate but fails because the HEAP is filled, and so we try to collect (oops).

Answer 1 · 2022-06-10T08:11:23.000Z

AFAIK: the problem is relying on the Crystal stdlib (Mutex) when we should rely on libc instead (pthread_mutex). The former relies on the GC when the latter won't.

Answer 2 · 2022-06-10T08:25:04.000Z

~~Using Thread::Mutex may be fine as a quick workaround, but it won't do in the long run, because it raises on errors which will allocate an exception class.~~ It doesn't even work, and crashes with an infinite recursion.

Answer 3 · 2022-06-10T08:41:42.000Z

I tried to call GC_lock and GC_unlock but I get a reentrant/concurrent call into GC_collect:

GC.collect: GC_lock (fiber=main)
GC.collect: GC_lock (fiber=Signal Loop)
GC.collect: GC_unlock (fiber=Signal Loop)
<hangs forever>

It then hangs forever because the second GC.collect overwrote @@pending so the main fiber won't be resumed ever again.

Answer 4 · 2022-06-10T09:05:03.000Z

It looks like an Atomic::Flag can workaround the issue.