OpenMathLib/OpenBLAS

0.3.28 zgemm/cgemm test failures on aarch64

Closed this issue · 3 comments

I am trying to update openblas 0.3.26->0.3.28 on fedora/centos. On fedora I have some test failures on aarch64. On centos all tests passed. Both use Neoverse-N1, but on failing Fedora tests there is 12 cores, on passing centos there is 32 cores.

I saw in 0.3.28 patch notes there were some aarch64 specific gemm changes, so it could be related to that.

hw_info_fedora_fail.log
build.log (too big to attach here directly)

TEST 1439/1526 zgemm:transa_conjnotransb [FAIL]
  ERR: test_extensions/test_zgemm.c:272  expected 0.000e+00, got 7.890e+00 (diff -7.890e+00, tol 1.000e-13)
TEST 1440/1526 zgemm:conjnotransa_transb [FAIL]
  ERR: test_extensions/test_zgemm.c:250  expected 0.000e+00, got 7.735e+00 (diff -7.735e+00, tol 1.000e-13)
TEST 1441/1526 zgemm:conjnotransa_conjnotransb [FAIL]
  ERR: test_extensions/test_zgemm.c:228  expected 0.000e+00, got 8.489e+00 (diff -8.489e+00, tol 1.000e-13)
TEST 1442/1526 zgemm:conjnotransa_notransb [FAIL]
  ERR: test_extensions/test_zgemm.c:206  expected 0.000e+00, got 7.586e+00 (diff -7.586e+00, tol 1.000e-13)
TEST 1443/1526 zgemm:conjnotransa_conjtransb [FAIL]
  ERR: test_extensions/test_zgemm.c:184  expected 0.000e+00, got 8.016e+00 (diff -8.016e+00, tol 1.000e-13)
TEST 1444/1526 zgemm:notransa_conjnotransb [FAIL]
  ERR: test_extensions/test_zgemm.c:162  expected 0.000e+00, got 7.245e+00 (diff -7.245e+00, tol 1.000e-13)
TEST 1445/1526 zgemm:conjtransa_conjnotransb [FAIL]
  ERR: test_extensions/test_zgemm.c:140  expected 0.000e+00, got 8.251e+00 (diff -8.251e+00, tol 1.000e-13)
TEST 1446/1526 cgemm:transa_conjnotransb [FAIL]
  ERR: test_extensions/test_cgemm.c:272  expected 0.000e+00, got 7.783e+00 (diff -7.783e+00, tol 1.000e-04)
TEST 1447/1526 cgemm:conjnotransa_transb [FAIL]
  ERR: test_extensions/test_cgemm.c:250  expected 0.000e+00, got 7.625e+00 (diff -7.625e+00, tol 1.000e-04)
TEST 1448/1526 cgemm:conjnotransa_conjnotransb [FAIL]
  ERR: test_extensions/test_cgemm.c:228  expected 0.000e+00, got 7.754e+00 (diff -7.754e+00, tol 1.000e-04)
TEST 1449/1526 cgemm:conjnotransa_notransb [FAIL]
  ERR: test_extensions/test_cgemm.c:206  expected 0.000e+00, got 8.025e+00 (diff -8.025e+00, tol 1.000e-04)
TEST 1450/1526 cgemm:conjnotransa_conjtransb [FAIL]
  ERR: test_extensions/test_cgemm.c:184  expected 0.000e+00, got 8.346e+00 (diff -8.346e+00, tol 1.000e-04)
TEST 1451/1526 cgemm:notransa_conjnotransb [FAIL]
  ERR: test_extensions/test_cgemm.c:162  expected 0.000e+00, got 8.060e+00 (diff -8.060e+00, tol 1.000e-04)
TEST 1452/1526 cgemm:conjtransa_conjnotransb [FAIL]
  ERR: test_extensions/test_cgemm.c:140  expected 0.000e+00, got 8.272e+00 (diff -8.272e+00, tol 1.000e-04)

That is a bit suspicious, I don't think we had any recent changes in complex GEMM kernels for the N1. Are you using the same compiler version on both (and can you get the CentOS build to fail by running openblas_utest_ext with OPENBLAS_NUM_THREADS=12) ?

@martin-frbg Yes I got it to fail the same way on c10s limiting openblas to 12 threads. gcc version is the same afaik.
Build log link if you were interested: https://drive.google.com/file/d/1h5xDycLQUESGJjfPG0ewO8MH746fRqw_/view?usp=sharing

reproduced on the new Ampere Altra in the GCC Compile Farm. Very odd, only fails at 12,13 and 24 threads. I'm more inclined to suspect an oddity in the test code itself...