Deadlock after fork when calling dgetrf_

As described in numpy/numpy#30092 and scipy/scipy#23686, there is a deadlock in OpenBLAS when calling dgetrf_ after a fork. I instrumented the calls to LOCK_COMMAND and UNLOCK_COMMAND in blas_server.c and I think the problem is in exec_blas_async. This is "new" after #5170.

Here is the main() of the test code

int main() {
    int64_t m = 200, n = 200;
    int64_t lda = m;
    int64_t info;
    int64_t ipiv[200];

    // array is an identity matrix
    double arr[200*200];
    for (int i = 0; i < m*n; i += n + 1) {
        arr[i] = 1.0;
    }

    printf("before fork\n");
    pid_t pid = fork();
    printf("after fork\n");
    if (pid == 0) {
        printf("inside child\n");
        exit(0);
    } else {
        wait(NULL);
    }

    printf("before dgetrf\n");
    dgetrf_(&m, &n, arr, &lda, ipiv, &info);
    printf("after dgetrf\n");

and here is what I see with debug printing (on OpenBLAS HEAD, using ``)

installing atfork handler in memory::openblas_fork_handler 2015
in blas_thread_init 565
in blas_thread_init 567 server_lock locked
in blas_thread_init 615
in blas_thread_init 623
in blas_thread_init 626 server_lock unlocked
before fork
in blas_thread_shutdown
in blas_thread_shutdown 1000 server_lock locked
in blas_thread_shutdown 1042 server_lock unlocked
after fork
after fork
inside child
in blas_thread_shutdown
in blas_thread_shutdown 1000 server_lock locked
in blas_thread_shutdown 1042 server_lock unlocked
before dgetrf
in exec_blas_async 644
in exec_blas_async 647 server_lock locked
in blas_thread_init 565

Note the call to LOCK_COMMAND in exec_blas_async, and then the call to blas_thread_init, which again tries to call LOCK_COMMAND. Boom.

OpenBLAS/driver/others/blas_server.c

Lines 638 to 644 in 0c59ae0

    
           #ifdef SMP_SERVER 
        
             // Handle lazy re-init of the thread-pool after a POSIX fork 
        
             LOCK_COMMAND(&server_lock); 
        
             if (unlikely(blas_server_avail == 0)) blas_thread_init(); 
        
             UNLOCK_COMMAND(&server_lock); 
        
           #endif 
        
             BLASLONG i = 0;

I am not sure what the best way is to solve this. Note that the first thing blas_thread_init does is to check blas_server_avail (with no lock), so maybe the lock/unlock in exec_blas_async should be removed?

cc @FreddieWitherden

In trying to understand the case that motivated the lock/unlock in exec_blas_async from #5170, I see it fixes #5104, #5147, and parts of #5153. I can understand the change for atomic calls, did adding the lock/unlock also fix something? If so, maybe we can add an argument to blas_thread_init(int locking_needed) and pass in false when called from exec_blas_async.

see #5479 - this was a brain fart solely based on a valgrind warning. ISTR reverting it did not fix the scipy issue.

Removing the locks does solve the reproducer here and in numpy/numpy#30092 on ubuntu 24.04. The scipy issue does not reproduce for me on latest scipy HEAD on linux, maybe I need to try it on macos?

In the scipy issue, there is a backtrace with the telltale exec_blas_async calling blas_thread_init which must hang in the current code when using pthreads. I don't know about the atomic* parts of #5479 but removing the locks seems prudent.

frame #3: 0x000000012535561a libscipy_openblas.dylib`blas_thread_init + 42
frame #4: 0x0000000125355ab0 libscipy_openblas.dylib`exec_blas_async + 336

	#ifdef SMP_SERVER
	// Handle lazy re-init of the thread-pool after a POSIX fork
	LOCK_COMMAND(&server_lock);
	if (unlikely(blas_server_avail == 0)) blas_thread_init();
	UNLOCK_COMMAND(&server_lock);
	#endif
	BLASLONG i = 0;