OpenMathLib/OpenBLAS

Non-deterministic output with multiple OpenMP runtimes in the same process

lightsighter opened this issue · 21 comments

I observe non-deterministic output when calling into cblas_sgemv concurrently from two different threads each of which are bound to a different copy of an OpenMP runtime in the same process. I built OpenBLAS 0.3.5 with USE_OPENMP=1, USE_THREAD=1, NUM_PARALLEL=2, and NUM_THREADS=40. Each copy of the OpenMP runtime has 20 threads in its thread pool, and I call 'openblas_set_num_threads' to 20 once before doing any other OpenBLAS calls (which with 2 copies of the OpenMP runtime results in 40 total threads, the same as NUM_THREADS). Everything is deterministic if I take care to ensure that one thread completes its call to cblas_sgemv before the second thread does its call cblas_sgemv. However, if both threads call into cblas_sgemv concurrently each with their own OpenMP runtime, then non-determinism results. My guess is that there is one or more global variables somewhere in OpenBLAS that is/are suffering from a race with concurrent invocations each with OpenMP but I've been so far unable to find it in the source.

Note you can't reproduce this with any of the standard OpenMP runtimes (e.g. the ones shipped by GCC, LLVM, or Intel as they all have global variables that prevent multiple copies of their OpenMP runtimes from existing concurrently in the same process). I'm using the copy of OpenMP provided by the Realm runtime which does support concurrent copies of the OpenMP runtime existing in the same process:
https://github.com/StanfordLegion/legion/blob/stable/runtime/realm/openmp/openmp_api.cc

I realize this is a very unusual use case for you to consider, but I'm hoping that it's an easy fix to just get rid of one or a few global variables somewhere.

This is probably a variation of #1851 - sadly there is more than just a single global variable involved in this design limitation from the original GotoBLAS. My crude workaround in #1875 left out (conventional) OPENMP on purpose as that appeared to do a good enough job at avoiding conflicts. You could build OpenBLAS with USE_SIMPLE_THREADED_LEVEL3=1 to avoid using the problematic multithreaded GEMM path altogether, or try removing the #ifndef USE_OPENMP in level3_thread.c to block accesses from the second instance until the first has completed its call.

Thanks @martin-frbg for the information. Unfortunately I really do need these calls to run concurrently and with OpenMP for performance reasons so serializing them inside OpenBLAS is not a reasonable option. Are there plans to eliminate these global variables eventually in OpenBLAS?

Depends on your definition of "eventually" - this project has only a small number of semi-regular contributors so PRs are always welcome. Have you checked performance with USE_SIMPLE_THREADED_LEVEL3 - this should still employ some parallelism ?

There is just no way dual OMP mutex functions namd same could ever work. Your faulty setup is doomed to fail with or without OpenBLAS.

@brada4 Read my full post again more carefully. I'm not using one of the normal implementations of the OpenMP runtime. I have a special one that supports concurrent instances of the OpenMP runtime in the same process.

@martin-frbg I see calls into OpenMP being serialized as you describe and the results are deterministic. Performance is 2X less than it should be, which for this particular workloads is unacceptable. Pretty much the whole application consists of concurrent BLAS calls.

Is using the non-OMP threading (USE_OPENMP=0) an option for you?

@lightsighter both iomp and clang omp simulate gomp symbols to act as replacements. The library you point to just permits to nest KMP with OMP, you coud nest native threads within any OpenMP as @TiborGY points out.

@TiborGY using OpenMP is a requirement for me for performance reasons.

@brada4 You statement that "the library you point to just permits to next KMP with OMP" is incorrect. That is not at all how the OpenMP library that I'm using works or its purpose. I also can't do lots of tiny little BLAS calls on different threads for performance reasons. I need lots of threads cooperating on just a few very large BLAS operations. Any attempts to decompose them will degrade performance considerably.

You could try removing the "!defined(USE_OPENMP)" from the #ifdefs around locking calls in memory.c, in particular between lines 2700 and 2800 (those seemed to be the crucial sections in the somewhat related #2126).

@TiborGY using OpenMP is a requirement for me for performance reasons.

I think you might have misunderstood what I tried to say. I am not saying you should stop using OpenMP in your program, but to compile OpenBLAS with USE_OPENMP=0 and USE_THREAD=1. This will use native threading inside OpenBLAS instead of the OMP threading.
The last time I tested the various threading types, on real world problems I use OpenBLAS for (QR factorization of approx. 10 000 x 50 000 matrices), the OMP threading was significantly slower than the native threading.

Just that you propose similar, working idea, that is close to library in the OP, which in turn wraps IPP parallel section inside OMP parallel ones, I just looked into it for signs of weird namespace masking/aliasing and found none.

@lightsighter unfortunately I`m having some problems with setting up Legion for testing - I'm trying to build on the provided omp_saxpy example, but that one already fails to run with an error message from default_mapper about failing to find any valid variants for the task. Apparently it cannot figure out the (presence of?) the cpu - does this need any additional build- or runtime configuration, I did not notice anything about this in the documentation ?

Yes, unfortunately the runtime under Legion (called Realm) still needs you to tell it how to configure the machine. In this case you'll want to use the -ll:ocpu <N> command line options to say how many OpenMP processors you want to create, and the -ll:othr <N> option to say how many threads to create in the thread pool for each of those OpenMP processors.

I might also recommend that you talk to @marcinz from the cunumeric project as I know they have some tests that had to disable running with multiple OpenMP processors in Legion because of this issue, and they can probably give you a very simple command line that will do some BLAS computations using multiple OpenMP processors.

Thanks - indeed @marcinz if you could point me to some test you had to xfail because of problems in OpenBLAS that would be very helpful. Running cunumeric's test.py without modifications did not bring up any obvious differences between trying with --omp 2 and without, but I am not familiar with your code.

I suspect none of the tests in the test suite are actually big enough to exercise the bug. Legate's default partitioning heuristics won't over-partition the input arrays to span multiple OpenMP processors unless it is profitable to do so. You can force them to always decompose data to as many processors as available by running with LEGATE_TEST=1 in your environment. Even then the failure mode is non-deterministic and rare: it will look like non-deterministically getting the wrong answer, but no crashes, segfaults, aborts or anything obvious like that. The tasks need to be big enough that they're both executing concurrently on the different OpenMP processors which is a bit difficult to orchestrate at smaller scales.

Thanks for the LEGATE_TEST=1 - I'll just try looping the test for now until something breaks and/or running on a really big machine (provided I can get it to run on arm64 or power9 server hardware).

I have a standalone omp program which runs multi-threaded openblas dgemm under 4 concurrent pthreads, and it produced the correct flops and results. I compiled the openblas with USE_OPENMP=1, NO_AFFINITY=0, NUM_PARALLEL=4. The version of the openblas I used is bf3183d from Oct 10, 2023.
omp_dgemm.txt

I'll just try looping the test for now until something breaks and/or running on a really big machine (provided I can get it to run on arm64 or power9 server hardware).

Make sure you increase the problem size as well. I run Legate pretty regularly on my M1 so it should work on arm64. I know there are folks running Legion on Power9 (Summit and Sierra), but I also know for certain that Legate does not have CI for PowerPC at the moment.

I have a standalone omp program which runs multi-threaded openblas dgemm under 4 concurrent pthreads,

@eddy16112 I don't think that is sufficient to reproduce the issue. You've got four threads, but only one copy of the OpenMP runtime in the process. I know you're setting the number of OpenMP threads inside each pthread to a smaller value than the total number of threads in the thread pool, but I still think the fact that there is one OpenMP runtime will serialize the execution of those loops in some order by the OpenMP runtime. You need multiple copies of the OpenMP runtime in the same process to be capable of executing multiple loop bodies in parallel.

I am not sure how to create multiple OpenMP runtime instances. OpenMP did not define it. The goal of my reproducer is to make sure that we are able to run multiple OpenBLAS DGEMMs (or other kernels) concurrently, e.g. one DGEMM per numa node, and all the DGEMMs will produce the correct flops (flops of sequential DGEMM * number of cores used by each DGEMM).

but I still think the fact that there is one OpenMP runtime will serialize the execution of those loops in some order by the OpenMP runtime.

I do not think so, as I said, all the DGEMMs within each pthread achieve the correct performance.

Make sure you increase the problem size as well.

OK, is there an easy way to do this ? Remember I am not at all familiar with cunumeric, so far I'm only invoking cunumeric.test()

I run Legate pretty regularly on my M1 so it should work on arm64.

OK, will try that then - running on a 12-core x86_64 did not bring up anything so far and my build attempt on power9
hit a known problem in tblis where it includes the x86 immintrin.h unconditionally (and I failed to work around this as
feeding the location of a "fixed" tblis to cunumeric's install.py does not appear to work - it was still unpacking a pristine copy for its build)

I am not sure how to create multiple OpenMP runtime instances. OpenMP did not define it.

right, that is quite a non-standard implementation (if not to say, extension) of OpenMP... (also found some rather strong-worded comments to this effect in an open issue ticket in tblis in the meantime) I have some experimental code that duplicates all the internal queue structures NUM_PARALLEL times which I think would remove any potential conflicts, but it would be a rather invasive and ugly change that would increase the memory footprint considerably.