Deadlock after fork when calling dgetrf_
Closed this issue · 5 comments
As described in numpy/numpy#30092 and scipy/scipy#23686, there is a deadlock in OpenBLAS when calling dgetrf_ after a fork. I instrumented the calls to LOCK_COMMAND and UNLOCK_COMMAND in blas_server.c and I think the problem is in exec_blas_async. This is "new" after #5170.
Here is the main() of the test code
int main() {
int64_t m = 200, n = 200;
int64_t lda = m;
int64_t info;
int64_t ipiv[200];
// array is an identity matrix
double arr[200*200];
for (int i = 0; i < m*n; i += n + 1) {
arr[i] = 1.0;
}
printf("before fork\n");
pid_t pid = fork();
printf("after fork\n");
if (pid == 0) {
printf("inside child\n");
exit(0);
} else {
wait(NULL);
}
printf("before dgetrf\n");
dgetrf_(&m, &n, arr, &lda, ipiv, &info);
printf("after dgetrf\n");
and here is what I see with debug printing (on OpenBLAS HEAD, using ``)
installing atfork handler in memory::openblas_fork_handler 2015
in blas_thread_init 565
in blas_thread_init 567 server_lock locked
in blas_thread_init 615
in blas_thread_init 623
in blas_thread_init 626 server_lock unlocked
before fork
in blas_thread_shutdown
in blas_thread_shutdown 1000 server_lock locked
in blas_thread_shutdown 1042 server_lock unlocked
after fork
after fork
inside child
in blas_thread_shutdown
in blas_thread_shutdown 1000 server_lock locked
in blas_thread_shutdown 1042 server_lock unlocked
before dgetrf
in exec_blas_async 644
in exec_blas_async 647 server_lock locked
in blas_thread_init 565
Note the call to LOCK_COMMAND in exec_blas_async, and then the call to blas_thread_init, which again tries to call LOCK_COMMAND. Boom.
OpenBLAS/driver/others/blas_server.c
Lines 638 to 644 in 0c59ae0
I am not sure what the best way is to solve this. Note that the first thing blas_thread_init does is to check blas_server_avail (with no lock), so maybe the lock/unlock in exec_blas_async should be removed?
In trying to understand the case that motivated the lock/unlock in exec_blas_async from #5170, I see it fixes #5104, #5147, and parts of #5153. I can understand the change for atomic calls, did adding the lock/unlock also fix something? If so, maybe we can add an argument to blas_thread_init(int locking_needed) and pass in false when called from exec_blas_async.
see #5479 - this was a brain fart solely based on a valgrind warning. ISTR reverting it did not fix the scipy issue.
Removing the locks does solve the reproducer here and in numpy/numpy#30092 on ubuntu 24.04. The scipy issue does not reproduce for me on latest scipy HEAD on linux, maybe I need to try it on macos?
In the scipy issue, there is a backtrace with the telltale exec_blas_async calling blas_thread_init which must hang in the current code when using pthreads. I don't know about the atomic* parts of #5479 but removing the locks seems prudent.
frame #3: 0x000000012535561a libscipy_openblas.dylib`blas_thread_init + 42
frame #4: 0x0000000125355ab0 libscipy_openblas.dylib`exec_blas_async + 336