ysbaddaden/gc

`GC_collect` tries to realloc (oops)

Closed this issue · 4 comments

Here is an interesting stacktrace I just got:

#0  []= () at /usr/share/crystal/src/pointer.cr:130
#1  push () at /usr/share/crystal/src/deque.cr:480
#2  0x000055555558a3af in lock_slow () at /usr/share/crystal/src/mutex.cr:70
#3  0x000055555558d882 in lock () at /usr/share/crystal/src/mutex.cr:51
#4  collect () at /usr/share/crystal/src/mutex.cr:132
#5  0x0000555555591e5c in GlobalAllocator_tryCollect (self=0x5555557a2290) at src/global_allocator.c:151
#6  GC_GlobalAllocator_nextBlock (self=0x5555557a2290) at src/global_allocator.c:175
#7  0x0000555555592a3d in LocalAllocator_initCursor (self=0x5555557a2530) at src/local_allocator.c:10
#8  GC_LocalAllocator_allocateSmall (self=0x5555557a2530, size=32, atomic=0) at src/local_allocator.c:135
#9  0x000055555556328c in malloc () at /home/github/immix.cr/src/immix.cr:23
#10 __crystal_malloc64 () at /usr/share/crystal/src/gc.cr:24
#11 malloc () at /usr/share/crystal/src/primitives.cr:164
#12 increase_capacity () at /usr/share/crystal/src/deque.cr:587
#13 push () at /usr/share/crystal/src/deque.cr:477
#14 0x000055555558a3af in lock_slow () at /usr/share/crystal/src/mutex.cr:70
#15 0x000055555558d882 in lock () at /usr/share/crystal/src/mutex.cr:51
#16 collect () at /usr/share/crystal/src/mutex.cr:132
#17 0x0000555555591e5c in GlobalAllocator_tryCollect (self=0x5555557a2290) at src/global_allocator.c:151
#18 GC_GlobalAllocator_nextBlock (self=0x5555557a2290) at src/global_allocator.c:175
#19 0x0000555555592a3d in LocalAllocator_initCursor (self=0x5555557a2530) at src/local_allocator.c:10
#20 GC_LocalAllocator_allocateSmall (self=0x5555557a2530, size=24, atomic=0) at src/local_allocator.c:135
#21 0x000055555556f4b2 in malloc () at /home/github/immix.cr/src/immix.cr:23
#22 __crystal_malloc64 () at /usr/share/crystal/src/gc.cr:24
#23 new () at /usr/share/crystal/src/deque.cr:25
#24 wait_readable () at /usr/share/crystal/src/io/evented.cr:125
#25 wait_readable () at /usr/share/crystal/src/io/evented.cr:119
#26 unbuffered_read () at /usr/share/crystal/src/io/evented.cr:59
#27 0x000055555556f349 in fill_buffer () at /usr/share/crystal/src/io/buffered.cr:261
#28 read () at /usr/share/crystal/src/io/buffered.cr:82
#29 0x00005555555700e0 in read_fully? () at /usr/share/crystal/src/io.cr:537
#30 read_fully () at /usr/share/crystal/src/io.cr:520
#31 0x0000555555587aa5 in decode () at /usr/share/crystal/src/io/byte_format.cr:123
#32 from_io () at /usr/share/crystal/src/int.cr:759
#33 read_bytes () at /usr/share/crystal/src/io.cr:913
#34 -> () at /usr/share/crystal/src/signal.cr:244
#35 0x00005555555865e9 in run () at /usr/share/crystal/src/fiber.cr:146
#36 0x0000000000000000 in ?? ()

We can see that we try to allocate (#23: Deque.new) but the HEAP has been filled, so the GC starts a collection, and calls GC_collect to collect all the stacks to walk, but... the Deque is too small and we try to increase it's capacity (#12), which tries to allocate but fails because the HEAP is filled, and so we try to collect (oops).

AFAIK: the problem is relying on the Crystal stdlib (Mutex) when we should rely on libc instead (pthread_mutex). The former relies on the GC when the latter won't.

Using Thread::Mutex may be fine as a quick workaround, but it won't do in the long run, because it raises on errors which will allocate an exception class. It doesn't even work, and crashes with an infinite recursion.

I tried to call GC_lock and GC_unlock but I get a reentrant/concurrent call into GC_collect:

GC.collect: GC_lock (fiber=main)
GC.collect: GC_lock (fiber=Signal Loop)
GC.collect: GC_unlock (fiber=Signal Loop)
<hangs forever>

It then hangs forever because the second GC.collect overwrote @@pending so the main fiber won't be resumed ever again.

It looks like an Atomic::Flag can workaround the issue.