0.3.28 zgemm/cgemm test failures on aarch64
Closed this issue · 3 comments
I am trying to update openblas 0.3.26->0.3.28 on fedora/centos. On fedora I have some test failures on aarch64. On centos all tests passed. Both use Neoverse-N1
, but on failing Fedora tests there is 12 cores, on passing centos there is 32 cores.
I saw in 0.3.28 patch notes there were some aarch64 specific gemm changes, so it could be related to that.
hw_info_fedora_fail.log
build.log (too big to attach here directly)
TEST 1439/1526 zgemm:transa_conjnotransb [FAIL]
ERR: test_extensions/test_zgemm.c:272 expected 0.000e+00, got 7.890e+00 (diff -7.890e+00, tol 1.000e-13)
TEST 1440/1526 zgemm:conjnotransa_transb [FAIL]
ERR: test_extensions/test_zgemm.c:250 expected 0.000e+00, got 7.735e+00 (diff -7.735e+00, tol 1.000e-13)
TEST 1441/1526 zgemm:conjnotransa_conjnotransb [FAIL]
ERR: test_extensions/test_zgemm.c:228 expected 0.000e+00, got 8.489e+00 (diff -8.489e+00, tol 1.000e-13)
TEST 1442/1526 zgemm:conjnotransa_notransb [FAIL]
ERR: test_extensions/test_zgemm.c:206 expected 0.000e+00, got 7.586e+00 (diff -7.586e+00, tol 1.000e-13)
TEST 1443/1526 zgemm:conjnotransa_conjtransb [FAIL]
ERR: test_extensions/test_zgemm.c:184 expected 0.000e+00, got 8.016e+00 (diff -8.016e+00, tol 1.000e-13)
TEST 1444/1526 zgemm:notransa_conjnotransb [FAIL]
ERR: test_extensions/test_zgemm.c:162 expected 0.000e+00, got 7.245e+00 (diff -7.245e+00, tol 1.000e-13)
TEST 1445/1526 zgemm:conjtransa_conjnotransb [FAIL]
ERR: test_extensions/test_zgemm.c:140 expected 0.000e+00, got 8.251e+00 (diff -8.251e+00, tol 1.000e-13)
TEST 1446/1526 cgemm:transa_conjnotransb [FAIL]
ERR: test_extensions/test_cgemm.c:272 expected 0.000e+00, got 7.783e+00 (diff -7.783e+00, tol 1.000e-04)
TEST 1447/1526 cgemm:conjnotransa_transb [FAIL]
ERR: test_extensions/test_cgemm.c:250 expected 0.000e+00, got 7.625e+00 (diff -7.625e+00, tol 1.000e-04)
TEST 1448/1526 cgemm:conjnotransa_conjnotransb [FAIL]
ERR: test_extensions/test_cgemm.c:228 expected 0.000e+00, got 7.754e+00 (diff -7.754e+00, tol 1.000e-04)
TEST 1449/1526 cgemm:conjnotransa_notransb [FAIL]
ERR: test_extensions/test_cgemm.c:206 expected 0.000e+00, got 8.025e+00 (diff -8.025e+00, tol 1.000e-04)
TEST 1450/1526 cgemm:conjnotransa_conjtransb [FAIL]
ERR: test_extensions/test_cgemm.c:184 expected 0.000e+00, got 8.346e+00 (diff -8.346e+00, tol 1.000e-04)
TEST 1451/1526 cgemm:notransa_conjnotransb [FAIL]
ERR: test_extensions/test_cgemm.c:162 expected 0.000e+00, got 8.060e+00 (diff -8.060e+00, tol 1.000e-04)
TEST 1452/1526 cgemm:conjtransa_conjnotransb [FAIL]
ERR: test_extensions/test_cgemm.c:140 expected 0.000e+00, got 8.272e+00 (diff -8.272e+00, tol 1.000e-04)
That is a bit suspicious, I don't think we had any recent changes in complex GEMM kernels for the N1. Are you using the same compiler version on both (and can you get the CentOS build to fail by running openblas_utest_ext with OPENBLAS_NUM_THREADS=12) ?
@martin-frbg Yes I got it to fail the same way on c10s limiting openblas to 12 threads. gcc version is the same afaik.
Build log link if you were interested: https://drive.google.com/file/d/1h5xDycLQUESGJjfPG0ewO8MH746fRqw_/view?usp=sharing
reproduced on the new Ampere Altra in the GCC Compile Farm. Very odd, only fails at 12,13 and 24 threads. I'm more inclined to suspect an oddity in the test code itself...