mpimd-csc/qrupdate-ng

need help: test fails on darwin x86-64 (but curiously not aarch64)

donn opened this issue · 4 comments

donn commented

The test test_tch1dn/zch1dn fails on x86-64 versions of macOS, but not aarch64.

The residual errors are worse overall on the former, but it is only large enough to tip over into a failure for zch1dn.

Is this too serious of a problem? Is there a way for me to adjust the threshold?

x86-64 test log
Output:
----------------------------------------------------------

 testing Cholesky rank-1 downdate routines.
 All residual errors are expected to be small.

 sch1dn test:
      residual error =              0.572204589844E-05       PASS
 dch1dn test:
      residual error =              0.888178419700E-14       PASS
 cch1dn test:
      residual error =              0.953972266871E-05       PASS
 zch1dn test:
      residual error =              0.497379915032E-13       FAIL
----------------------------------------------------------------------
 total:     PASSED   3     FAILED   1
aarch64 test log
Output:
----------------------------------------------------------

 testing Cholesky rank-1 downdate routines.
 All residual errors are expected to be small.

 sch1dn test:
      residual error =              0.476837158203E-05       PASS
 dch1dn test:
      residual error =              0.106581410364E-13       PASS
 cch1dn test:
      residual error =              0.152587890625E-04       PASS
 zch1dn test:
      residual error =              0.284217094304E-13       PASS
----------------------------------------------------------------------
 total:     PASSED   4     FAILED   0

Can you give some more details about the used BLAS library? Since the must be a reason. Nevertheless I seem to be safe to adjust the tolerance a bit.
Just change the the factor 2D2 in

if (rnrm < 2d2*dlamch('p')) then

to 1D3 ,

Hey, on Nixpkgs (Which also distributes software for both Darwin platforms) we experience exactly the same issue with x86_64-darwin (and not aarch64-darwin):

 1/13 Test  #1: test_tch1dn ......................***Failed   16.07 sec
 testing Cholesky rank-1 downdate routines.
 All residual errors are expected to be small.
 sch1dn test:
      residual error =              0.572204589844E-05       PASS
 dch1dn test:
      residual error =              0.106581410364E-13       PASS
 cch1dn test:
      residual error =              0.953972266871E-05       PASS
 zch1dn test:
      residual error =              0.497379915032E-13       FAIL
----------------------------------------------------------------------
 total:     PASSED   3     FAILED   1
Note: The following floating-point exceptions are signalling: IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
STOP 1

I don't have available the floating points we had on aarch64-darwin unfortunately (because there the tests passed). Our blas and lapack implementations are both based on openblas version 0.3.27. The build log of it is available here (for x86_64-darwin):

https://cache.nixos.org/log/vdk8dns4jvy1n7w1djhdy7i1a3ph37p0-openblas-0.3.27.drv

I don't have personally an x86_64-darwin machine, so I am able to only use our CI which is very slow for these platforms unfortunately, so I won't be able to help much in debugging. I hope the debugging information I provided helps a bit.

donn commented

It's the same blas version FWIW. I am using Nix to build qrupdate.

It seems that the tolerances need to be adjusted a bit more. I'll prepare a patch during the next days. But the solution seems to adjust this line

if (rnrm < 2d2*dlamch('p')) then
again.