Issues when calling cblas_sgemm from multiple threads
crayroud-nviso opened this issue ยท 20 comments
When calling cblas_sgemm in parallel, on different threads, I have sometimes wrong output results.
I my application I have two threads and use OpenBLAS with threads disabled (USE_THREAD OFF, NUM_THREADS 0).
I can easily test it running the same gemm twice (in the same thread) and verify that the results are the same.
cblas_sgemm(layout, TransA, TransB, M, N, K, alpha, A, lda, B, ldb, beta, C1, ldc);
cblas_sgemm(layout, TransA, TransB, M, N, K, alpha, A, lda, B, ldb, beta, C2, ldc);
// Compare C1 and C2
Sometimes the results are different when using multiple threads on different data. In single thread the comparison is successful.
Problem looks similar to #1755, but not working on version 0.3.6 with and without USE_TLS.
The difference is huge and not minor rounding.
I have not managed to reproduce it yet with a simple example.
Which CPU and operating system is this ? I guess you could try compiling OpenBLAS with USE_SIMPLE_THREADED_LEVEL3=1, and/or try how the test case from #1755 behaves on your system. If your OS is Linux, perhaps you could also try running your code from valgrind to see if that flags any thread races (unless you are using OpenMP, which confuses the tool) or invalid memory accesses.
Just going by your username, if you are building on ARMV8 with CMAKE this could also be something similar to #1870 (although that one should be fixed as well)
Error is on x86_64 running Ubuntu as operating system.
That should at least exclude #1870 - at some point it could be useful to know the cpu model as the choice of optimized gemm kernel depends on it. Have you checked that your program behaves correctly with the reference BLAS (or any other like ATLAS) just to exclude some subtle problem elsewhere in your application ?
CPU model: Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz
I have tested with blis and mkl, there is no error.
With the sample code of #1755 (adding missing * to build) and without printf('.') I have errors. Note that I have no errors when printf('.') is adding a delay.
Using sample code, with and without USE_SIMPLE_THREADED_LEVEL3 and USE_TLS I have errors.
Valgrind is not returning error and it's not failing as valgrind adds a big delay.
Hmm. 5820 is regular HASWELL kernel. At least we seem to have a simple reproducer now thanks to #1755 (pity about the printf actually hiding the problem). valgrind (with its default tool memcheck) at least suggests that there are no out-of-bounds accesses, next would be to run it with tool=helgrind to do an even slower check for race conditions. (Not saying that the burden is on you, just describing a workflow.)
Vagring using tool=helgrind is listed multiple possible race conditions when running the sample code.
Here is one:
==22541== Possible data race during write of size 4 at 0x583B00C by thread #9
==22541== Locks held: none
==22541== at 0x40C9C2: sgemm_incopy (in /devel/openblas/matrix-test)
==22541== by 0x402A32: sgemm_tn (in /devel/openblas/matrix-test)
==22541== by 0x401C16: cblas_sgemm (in/devel/openblas/matrix-test)
==22541== by 0x40186F: matrixrunner(void*) (in /devel/openblas/matrix-test)
==22541== by 0x4C34847: mythread_wrapper (hg_intercepts.c:389)
==22541== by 0x4E466B9: start_thread (pthread_create.c:333)
==22541== Address 0x583b00c is in a rw- anonymous segment
#1851 could also be related (although either its fix or USE_SIMPLE_THREADED_LEVEL3 should both have taken care of this)
Interestingly I cannot seem to reproduce the problem with the (corrected) test case from issue 1755 on my i7-8700K (also a hexacore, same HASWELL target as far as OpenBLAS is concerned).
(This was using current develop, gmake build with defaults except DEBUG=1 to get line number information in backtraces.)
Helgrind does not show any problems either, except some initial unguarded writes to the "threads_ready" and "go" variables of the test case itself.
Are you certain that your test codes are actually picking up the 0.3.6 libopenblas.so at runtime, or
could there be an older version in the library search path ? (It would have to be older than 0.3.4 in order to lack the fix for #1851)
Yes I'm using the version 0.3.6.
Can you reproduce the issue with the sample code without DEBUG=1? Timings seem important to reproduce the issue.
That occured to me as well, but I still cannot reproduce the issue even without the DEBUG=1. Perhaps compiler version plays a role as well ?
I used this compiler to build the sample code.
g++ --version
g++ (Ubuntu 5.5.0-12ubuntu1~16.04) 5.5.0 20171010
7.2.1 on my system, but if anything I would have expected compiler-related problems to show up with gcc 8 or 9 (whose more aggressive reuse of registers has uncovered a few bugs in the assembly kernels in the past). This is now beginning to look more like a hardware or firmware issue again.
Varying the number of threads and cycles has not caused the test to go bad either.
Have you compiled with USE_THREAD OFF and NUM_THREADS 0 as I did ?
What do you mean by "hardware or firmware issue again" ?
Can I do some more tests to give you more information ?
Good point, seems I got fixated on #1851 (and now I am not even sure if I had noticed the USE_THREAD=0 in #1755, based on the comments I made back then...). ๐
This could get interesting - it is iikely that many locks in OpenBLAS are only used (only compiled in) when the library itself is built with multithreading support. (In that case, a workaround could be to build with USE_THREAD=ON NUM_THREADS=1).
Same issue with USE_THREAD=ON and NUM_THREADS=1
To be sure here is how I changed the sample code #1755 to build and to reproduce the issue, is it the same as yours ?
#include <stdio.h>
#include <errno.h>
#include <string.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <math.h>
#include <pthread.h>
#include <stdlib.h>
#ifdef HAVE_MKL
#include "mkl.h"
#else
extern "C" {
#include "cblas.h"
}
#endif
int loop_max=10000;
float *Adata, *Bdata, *Mdata_ans, *Mdata_prime;
float alpha=1.0, beta=1.0;
int a_num_rows=1;
int a_num_cols=3000;
int a_stride=3000;
int b_num_rows=128;
int b_num_cols=3000;
int b_stride=3000;
int num_rows=1;
int num_cols=128;
int stride=128;
CBLAS_TRANSPOSE transA=CblasNoTrans, transB=CblasTrans;
int threads_ready=0;
pthread_mutex_t locker=PTHREAD_MUTEX_INITIALIZER;
bool go=false;
int main(int argc, char *argv[])
{
printf("2\n");
#ifdef HAVE_MKL
printf("MKL\n");
mkl_set_num_threads(1);
#else
printf("Openblas\n");
openblas_set_num_threads(1);
#endif
int thread_count=8;
if (argc > 1) {
int n=atoi(argv[1]);
if (n >= 0) thread_count=n;
if (argc > 2) {
n=atoi(argv[2]);
if (n >= 1) loop_max=n;
}
}
printf("Max thread=%d cycles=%d\n", thread_count, loop_max);
Adata=(float*)malloc(a_num_rows * a_num_cols * sizeof(float));
Mdata_prime=(float*)malloc(num_rows * num_cols * sizeof(float));
Bdata=(float*)malloc(b_num_rows * b_num_cols * sizeof(float));
Mdata_ans=(float*)malloc(num_rows * num_cols * sizeof(float));
for (int i=0; i<a_num_rows * a_num_cols; i++) Adata[i]=drand48();
for (int i=0; i<num_rows * num_cols; i++) Mdata_prime[i]=drand48();
for (int i=0; i<b_num_rows*b_num_cols; i++) Bdata[i]=drand48();
memcpy(Mdata_ans, Mdata_prime, sizeof(float)* num_rows * num_cols);
cblas_sgemm(CblasRowMajor, transA, transB, num_rows, num_cols,
transA==CblasNoTrans ? a_num_cols : a_num_rows,
alpha, Adata, a_stride, Bdata, b_stride, beta, Mdata_ans, stride);
pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
pthread_t pyr[thread_count];
struct timespec clock_start, clock_end;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &clock_start);
for (long i=0; i < thread_count; i++) {
void *matrixrunner(void *thrdata);
int r=pthread_create(&pyr[i], &attr, matrixrunner, (void *)(i+1));
if (r != 0) {
printf("Thread create failed %ld %d/%s\n", i, errno, strerror(errno));
}
}
while (thread_count > threads_ready) {}
go=true;
for (int i=0; i < thread_count; i++) {
void *status;
pthread_join(pyr[i], &status);
}
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &clock_end);
printf("Duration %f seconds\n", clock_end.tv_sec-clock_start.tv_sec+((float)(clock_end.tv_nsec-clock_start.tv_nsec)/1e9));
return(0);
}
void *matrixrunner(void* thrdata)
{
printf("%ld ", (long)thrdata);fflush(stdout);
int count=0;
float* Mdata=(float*)malloc(num_rows*num_cols*sizeof(float));
pthread_mutex_lock(&locker);
threads_ready++;
pthread_mutex_unlock(&locker);
while (!go) {}
while (count < loop_max) {
memcpy(Mdata, Mdata_prime, sizeof(float)*num_rows*num_cols);
cblas_sgemm(CblasRowMajor, transA, transB, num_rows, num_cols,
transA==CblasNoTrans ? a_num_cols : a_num_rows,
alpha, Adata, a_stride, Bdata, b_stride, beta, Mdata, stride);
float lms=0;
for (int i=0; i < num_rows*num_cols; i++) {
lms+=pow(fabs(Mdata_ans[i]-Mdata[i]), 2);
}
if (lms > .1) {
printf("BAD %.12f %ld (%d)\n", lms, syscall(SYS_gettid), count);
fflush(stdout);
_exit(0);
}
//printf(".");
//fflush(stdout);
count++;
}
free(Mdata);
pthread_exit(NULL);
return(NULL);
}
Yep, reproduced with USE_THREAD=0, can be mitigated ("BAD" result gone but lots of helgrind warnings remaining) by removing the "#if defined(SMP)" from a bunch of locking calls between lines 2700 to 2800 of memory.c . Will see later if helgrind goes quiet when I remove the remaining "SMP" clauses but there are also some potential conflicts between sgemm kernels that suggest the level3
may need serializing similar to level3_thread got in #1851.
Should be fixed by #2136 , which introduces a new option USE_LOCKING to be used in conjunction with USE_THREAD=0 for creating a single-threaded but thread-aware OpenBLAS. This avoids the (probable) performance penalty caused by the locking calls in use cases where OpenBLAS is not expected to be called from concurrent threads.
Thanks for your support, it fixes my issue