RTLD_GLOBAL Loads Don't Reliably work in VECs

Question

RTLD_GLOBAL Loads Don't Reliably work in VECs

Opened this issue 4 years ago · 6 comments

insertinterestingnamehere commented 4 years ago

Continuation of #9. MKL is fine now due to changes in how it's normally loaded, but numba is still broken, and I'm not sure what's going on with OpenMP generally. Numba fails when, through various layers of code, this call is executed: https://github.com/llvm/llvm-project/blob/d480f968ad8b56d3ee4a6b6df5532d485b0ad01e/llvm/lib/Support/Unix/DynamicLibrary.inc#L28. I'm not clear on why yet. Weirdly enough, IIRC, MKL was failing with a symbol not found error with the logic used at https://github.com/IntelPython/mkl-service/blob/master/mkl/__init__.py#L38, not a crash.

Answer 1 · 2021-04-27T17:12:51.000Z

Okay, at lest in the case of numba, the issue appears to actually be bad handling of RTLD_LAZY. (Use-case is here: https://github.com/numba/numba/blob/145e4435c9bea78071634db6084d8329129748ab/numba/__init__.py#L152). Hopefully this is just an issue with our dlopen forwarding, but it may be a problem with the underlying glibc patch.

Answer 2 · 2021-04-27T17:17:02.000Z

Not sure what was wrong with the old MKL loading logic though. That was with RTLD_NOW set.

Answer 3 · 2021-04-27T17:44:48.000Z

My dlopen wrapper forced some flags to specific values in the past. You might want to check if that's still true and if it's correct to do so.

Answer 4 · 2021-04-27T17:48:43.000Z

Maybe that's it. An quick reading earlier didn't show anything like that, but the log is showing RTLD_NOLOAD instead of RTLD_GLOBAL so something's off there.

Answer 5 · 2021-04-27T17:50:59.000Z

Yah, there's nothing obviously wrong in https://github.com/ut-parla/Parla.py/blob/master/runtime_libs/virt_dlopen.c#L30. No idea where the RTLD_NOLOAD is coming from, but it's before that line.

Answer 6 · 2021-04-27T20:23:38.000Z

Okay, there's a lot going on here and I'm getting a bit disoriented. Here's what I've managed to confirm:

Numba's try/except doesn't normally find the library in my current environment (it's libsvml.so)
The load into linker namespace 0 works fine
The load into linker namespace 1 is what fails
For some reason the load is happening with RTLD_NOLOAD | RTLD_LAZY instead of RTLD_GLOBAL | RTLD_LAZY.
The dlopen call is still getting routed through https://github.com/llvm/llvm-project/blob/d480f968ad8b56d3ee4a6b6df5532d485b0ad01e/llvm/lib/Support/Unix/DynamicLibrary.inc#L28 like I thought.
There's no obvious patch in the conda-forge recipe to show why that flag change would be happening.
There's also not currently any messing with the dlopen flags in our virtualized dlopen.
The call to load libsvml.so only happens once—in namespace 0. There is no corresponding call in namespace 1 that causes the error. I haven't been able to figure out what is triggering the failed attempted load of libsvml in namespace 1. This is probably why the stack of exception handling code isn't kicking in though. The load may be triggered by something else entirely.

All that said, as a short-term workaround, numba can be successfully loaded with this incantation:

import os
from parla.multiload import multiload, mark_module_as_global
os.environ['NUMBA_DISABLE_INTEL_SVML'] = '1'
mark_module_as_global('pkg_resources')
with multiload():
    import numba

Given that libsvml doesn't appear to even be present on my system, loading numba this way doesn't really lose us anything other than the import not working out of the box as expected.

Note: the pkg_resources business is needed to bypass another bug in the import override. I don't know the exact details, but it's likely because they mess with importlib in that module, so we're running up against #10.