numpy/numpy

BUG: Binary Builds Deadlock due to OpenBLAS threading issue with fork

Closed this issue ยท 15 comments

Describe the issue:

Consider the snippet shown below.

On my system with numpy installed via pip this deadlocks with GDB showing an issue in the embedded OpenBLAS where it is stuck on a mutex. Rerunning with export OMP_NUM_THREADS=1 resolves the issue. I suspect this is due to a recent change in how the binary distributions are prepared (maybe previously they were not using multithreaded OpenBLAS?) . The above snippet should be fine, however, as it is the parent which is continuing to use NumPy and its threads should not have changed. Maybe OpenBLAS has a bad atfork handler?

If this is intended behaviour (NumPy not being compatible with any scripts which explicitly call fork) then it might be worth using os.register_at_fork to ensure an exception is thrown.

Reproduce the code example:

import numpy as np
import os

# Do some algebra
A = np.random.randn(216, 216)
np.linalg.inv(A)

# Fork but have the child do nothing
if (pid := os.fork()) != 0:
    # Deadlock!
    np.linalg.inv(A)

    # Wait for the child
    os.waitpid(pid, 0)

Error message:

Python and NumPy Versions:

Build Dependencies:
  blas:
    detection method: pkgconfig
    found: true
    include directory: /opt/_internal/cpython-3.14.0/lib/python3.14/site-packages/scipy_openblas64/include
    lib directory: /opt/_internal/cpython-3.14.0/lib/python3.14/site-packages/scipy_openblas64/lib
    name: scipy-openblas
    openblas configuration: OpenBLAS 0.3.30  USE64BITINT DYNAMIC_ARCH NO_AFFINITY
      Haswell MAX_THREADS=64
    pc file directory: /project/.openblas
    version: 0.3.30
  lapack:
    detection method: pkgconfig
    found: true
    include directory: /opt/_internal/cpython-3.14.0/lib/python3.14/site-packages/scipy_openblas64/include
    lib directory: /opt/_internal/cpython-3.14.0/lib/python3.14/site-packages/scipy_openblas64/lib
    name: scipy-openblas
    openblas configuration: OpenBLAS 0.3.30  USE64BITINT DYNAMIC_ARCH NO_AFFINITY
      Haswell MAX_THREADS=64
    pc file directory: /project/.openblas
    version: 0.3.30
Compilers:
  c:
    commands: cc
    linker: ld.bfd
    name: gcc
    version: 14.2.1
  c++:
    commands: c++
    linker: ld.bfd
    name: gcc
    version: 14.2.1
  cython:
    commands: cython
    linker: cython
    name: cython
    version: 3.1.4
Machine Information:
  build:
    cpu: x86_64
    endian: little
    family: x86_64
    system: linux
  host:
    cpu: x86_64
    endian: little
    family: x86_64
    system: linux
Python Information:
  path: /tmp/build-env-qywnc8nt/bin/python
  version: '3.14'
SIMD Extensions:
  baseline:
  - SSE
  - SSE2
  - SSE3
  found:
  - SSSE3
  - SSE41
  - POPCNT
  - SSE42
  - AVX
  - F16C
  - FMA3
  - AVX2
  not found:
  - AVX512F
  - AVX512CD
  - AVX512_KNL
  - AVX512_KNM
  - AVX512_SKX
  - AVX512_CLX
  - AVX512_CNL
  - AVX512_ICL
  - AVX512_SPR

Runtime Environment:

No response

Context for the issue:

No response

I can confirm the deadlock, on CPython3.14 and 3.13 (and maybe others). I used 3.13 so I could go back in versions to see where it started. numpy==2.3.1 does not deadlock, numpy2.3.2 does. The corresponding versions of OpenBLAS are 0.3.29 (does not deadlock), 0.3.30 (does deadlock). Maybe connected to #29391.

Bisecting points to OpenMathLib/OpenBLAS#5170, which fixed some other threading problems. The backtrace at the hang is this. I am a bit confused why there multiple calls to dgetrf_parallel in the stack. @martin-frbg any thoughts?

(gdb) bt
#0  futex_wait (private=0, expected=2, futex_word=0x7ffff71efa80 <server_lock>) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait (futex=futex@entry=0x7ffff71efa80 <server_lock>, private=0) at ./nptl/lowlevellock.c:49
#2  0x00007ffff7ca0101 in lll_mutex_lock_optimized (mutex=0x7ffff71efa80 <server_lock>) at ./nptl/pthread_mutex_lock.c:48
#3  ___pthread_mutex_lock (mutex=0x7ffff71efa80 <server_lock>) at ./nptl/pthread_mutex_lock.c:93
#4  0x00007ffff605b388 in blas_thread_init.part () from /tmp/venv313/lib/python3.13/site-packages/scipy_openblas64/lib/libscipy_openblas64_.so
#5  0x00007ffff605b93b in exec_blas_async () from /tmp/venv313/lib/python3.13/site-packages/scipy_openblas64/lib/libscipy_openblas64_.so
#6  0x00007ffff6060fb3 in dgetrf_parallel () from /tmp/venv313/lib/python3.13/site-packages/scipy_openblas64/lib/libscipy_openblas64_.so
#7  0x00007ffff60607bd in dgetrf_parallel () from /tmp/venv313/lib/python3.13/site-packages/scipy_openblas64/lib/libscipy_openblas64_.so
#8  0x00007ffff60607bd in dgetrf_parallel () from /tmp/venv313/lib/python3.13/site-packages/scipy_openblas64/lib/libscipy_openblas64_.so
#9  0x00007ffff60607bd in dgetrf_parallel () from /tmp/venv313/lib/python3.13/site-packages/scipy_openblas64/lib/libscipy_openblas64_.so
#10 0x00007ffff5cfb890 in scipy_dgesv_64_ () from /tmp/venv313/lib/python3.13/site-packages/scipy_openblas64/lib/libscipy_openblas64_.so
#11 0x00007ffff784b660 in call_gesv (params=0x7fffffff79f0) at ../numpy/linalg/umath_linalg.cpp:1674
#12 inv<double> (args=0x7fffb911a720, dimensions=<optimized out>, steps=<optimized out>, __NPY_UNUSED_TAGGEDfunc=<optimized out>)
    at ../numpy/linalg/umath_linalg.cpp:1857
#13 0x00007fffb9873b39 in generic_wrapped_legacy_loop (__NPY_UNUSED_TAGGEDcontext=<optimized out>, data=<optimized out>, dimensions=<optimized out>, 
    strides=<optimized out>, auxdata=0x7fffb915e070) at ../numpy/_core/src/umath/legacy_array_method.c:98

Bisecting points to OpenMathLib/OpenBLAS#5170, which fixed some other threading problems. The backtrace at the hang is this. I am a bit confused why there multiple calls to dgetrf_parallel in the stack. @martin-frbg any thoughts?

I believe they're using a recursive implementation. See:

https://github.com/OpenMathLib/OpenBLAS/blob/develop/lapack/getrf/getrf_parallel.c#L678

which calls:

https://github.com/OpenMathLib/OpenBLAS/blob/develop/lapack/getrf/getrf_parallel.c#L764

I have always found the threading support in OpenBLAS to be somewhat twitchy. Likely not something I'd trust without a huge test suite.

ilayn commented

Did anyone had the capacity to try this on MKL or Accelerate platforms to single out OpenBLAS?

ev-br commented

Seems to run fine with numpy main (5fe514b), python 3.12, and

$ mamba list |grep mkl
# packages in environment at /home/ev-br/.conda/envs/numpy-dev-mkl:
mkl                       2025.3.0           h0e700b2_462    conda-forge
mkl-devel                 2025.3.0           ha770c72_462    conda-forge
mkl-include               2025.3.0           hf2ce2f3_462    conda-forge
$ cat fork.py 
import numpy as np
import os

# Do some algebra
A = np.random.randn(216, 216)
np.linalg.inv(A)

# Fork but have the child do nothing
if (pid := os.fork()) != 0:
    # Deadlock!
    np.linalg.inv(A)

    # Wait for the child
    os.waitpid(pid, 0)

$ time for i in `seq 1 100`; do python -P fork.py; done

real	0m11.428s
user	0m7.336s
sys	0m3.927s

On the same machine, does hang with

$ mamba list |grep openblas
libblas                   3.9.0           34_h59b9bed_openblas    conda-forge
libcblas                  3.9.0           34_he106b2a_openblas    conda-forge
liblapack                 3.9.0           34_h7ac8fdf_openblas    conda-forge
libopenblas               0.3.30          pthreads_h94d23a6_1    conda-forge
openblas                  0.3.30          pthreads_h6ec200e_1    conda-forge
$ mamba list |grep numpy
numpy                     2.2.6           py311h5d046bc_0    conda-forge

Having a reproducer for Linux/x86_64 (as opposed to a hang happening exclusively on OSX/x86_64 as with the related scipy issue) should help, thanks.
And regarding the perceived twitchyness - the problem is (largely) that thread safety was a very late consideration - with the original GotoBLAS and any subsequent OpenBLAS until the beginning of my involvement about 10 years ago, one had to use OpenMP and hope for the best, while the pthreads build was racy as hell

What is surprising is that it is the parent which has issues. When forking one would expect the child to get confused (since all of the other threads it thinks it spawned in the pool are now gone). But the parent should usually be fine even if it doesn't take any action. This makes me wonder if an atfork handler is at fault.

Also, if it makes any difference the code seems to work with lower thread counts i.e., export OMP_NUM_THREADS=2 does not exhibit the issue, but larger numbers do (and I was running on a 20 'core' machine).

Yes, I'm also thinking that it is the atfork handler that causes the deadlock - it uses the same (changed in PR5170) atomic ordering constraints as the main blas_server loop while it tries to shut down the thread pool.
And I guess higher thread count simply translates to a higher probability that races/clashes occur (which would also be a main reason for the blatant lack of thread safety having no bearing on the early success of GotoBLAS/OpenBLAS - not much of a race when one can have at most 4 contestants)

Hmm, seems the hang occurs after the atfork handler has returned.

That makes sense to me. From the backtrace it is making it to the inversion function. However, my working assumption is that the atfork handler is messing up the threadpool which causes subsequent calls to deadlock.

I mentioned this issue on scipy/scipy#23686, which is the same problem on SciPy. @lesteve commented with a C reproducer there that only uses OpenBLAS. I can confirm that the small C reproducer hangs on Ubuntu 24.04 x86_64 just like in this issue with NumPy. Here is a shortened incomplete version of the reproducer.

int main() {
    int64_t m = 200, n = 200;
    int64_t lda = m;
    int64_t info;
    int64_t ipiv[200];

    // array is an identity matrix
    double arr[200*200];
    for (int i = 0; i < m*n; i += n + 1) {
        arr[i] = 1.0;
    }

    printf("before fork\n");
    pid_t pid = fork();
    printf("after fork\n");
    if (pid == 0) {
        printf("inside child\n");
        exit(0);
    } else {
        wait(NULL);
    }

    printf("before dgetrf\n");
    dgetrf_(&m, &n, arr, &lda, ipiv, &info);
    printf("after dgetrf\n");


When I add some printing at various calls inside OpenBLAS, I see:

installing atfork handler in memory::openblas_fork_handler 2015
in blas_thread_init
before fork
in blas_thread_shutdown
after fork
after fork
inside child
in blas_thread_shutdown
before dgetrf
in blas_thread_init
<hangs>

Is it expected to see the additional call to blas_thread_shutdown? (Edit: yes, that is a consequence of exit()ing the child) In any case, the hang seems to be in blas_thread_init at the call to LOCK_COMMAND(&server_lock);.

It would be interesting to know the state of server_lock at the start (so when blas_thread_init is called for the very first time), before atfork, and post atfork. If the atfork handler is cleaning everything up before we fork (so bringing down the thread pool) then the variables should be reset to their initial states. I am guessing this is not happening and this is what is causing us to deadlock.

Let's move further discussion to OpenMathLib/OpenBLAS#5520

Given the issue now appears to be fixed on the OpenBLAS side what are the next steps? Is it worth reverting the OpenBLAS version used for the NumPy builds in the interim whilst we wait for a new release of OpenBLAS to be made?

mattip commented

MY plan is to finish this out in the coming week. If I can't get MacPython/openblas-libs#227 (which includes the upstream fix) to pass cleanly, I will patch the (edit: current) scipy-openblas packages to include the fix. Once there are scipy-openblas packages, it will take something like #30049 to use it here, and something similar for SciPy.