scylladb/seastar

Dependent library init can cause memory (re)allocation failures when test are run w/o `SEASTAR_DEFAULT_ALLOCATOR`

elcallio opened this issue · 2 comments

When SEASTAR_DEFAULT_ALLOCATOR is not active, we override malloc, free, realloc etc.

If an application, say a database, loads dependent libraries and does some dlinit time initializations that cause memory to be allocated, then later in the lifetime of the process, do something else, which causes said memory to be free:d/realloced or whatnot,
we can run into angry asserts that we are on the wrong CPU if we don't ensure to retain all usage of said external libraries to shard 0.

So far so good.
But when running unit tests, dlinit will happen in test runner thread, not actual reactor thread 0. Even though reactor init is delayed into the test runner thread (effectively thread 1).

The problem is that cpu_mem, which contains all info. is a thread local structure. Fair enough. But the cpu_id counter in it is assigned from an atomic counter, incremented on each init. This means that we end up with:

Test runner : cpu_mem{ cpu_id = 0 }
Reactor-0: cpu_mem{ cpu_id = 1 }
Reactor-1: cpu_mem{ cpu_id = 2 }

etc. Again, would not be a problem, except for the fact that the only way we have to ensure allocation from a third party that might combine/mess with data from a static init is handled on the same CPU, and we cannot in any meaningful way send this to the originating thread (which hogs cpu_id 0).

We do have a guard on malloc, checking if calling context is a reactor thread or not, and if not, we call back to iriginal libc malloc instead. But the detection of such memory in subsequent realloc is apparently not fool proof. Because we get past if sometimes, when alloc is made before original_malloc_func is init:ed

Provoked via openssl usage, which exhibits this (not great) pattern for certain algo resolutions. (See ossl_provider_set_operation_bit).
However, most dependencies, libc, boost etc allocs on dlinit.

Of course, the whole shebang is due to static init fiasco for original_malloc_func et all.

Note that the fiasco only appears if either we run the boost test harness, or otherwise realloc/free external objects allocated in dlinit time.