sandialabs/LCM

ACE MiniErosion test with denudation failing on Proxima/Algol post Fedora 35 update

ikalash opened this issue · 5 comments

The ACE MiniErosion test with denudation began failing after Algol were upgraded to Fedora 35 when the code is built with a Clang compiler. FPEs are encountered in the test:

https://sems-cdash-son.sandia.gov/cdash/test/1879062

-- Nonlinear Solver Step 0 -- 
||F|| = 2.557e+03  step = 0.000e+00  dx = 0.000e+00
************************************************************************

[algol:2823662:0:2823662] Caught signal 8 (Floating point exception: floating-point invalid operation)
==== backtrace (tid:2823662) ====
 0  /usr/lib64/libucs.so.0(ucs_handle_error+0x2a4) [0x7fe851d05864]
 1  /usr/lib64/libucs.so.0(+0x2a41d) [0x7fe851d0941d]
 2  /usr/lib64/libucs.so.0(+0x2a6fa) [0x7fe851d096fa]
 3  /home/lcm/LCM/trilinos-install-serial-clang-release/lib/libifpack2.so.13(_ZN7Ifpack27Details9ChebyshevIdN6Tpetra11MultiVectorIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS4_6SerialENS4_9HostSpaceEEEEEE7computeEv+0x41e) [0x7fe85a3354ae]
 4  /home/lcm/LCM/trilinos-install-serial-clang-release/lib/libifpack2.so.13(_ZN7Ifpack29ChebyshevIN6Tpetra9RowMatrixIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS3_6SerialENS3_9HostSpaceEEEEEE7computeEv+0x173) [0x7fe85a231843]
 5  /home/lcm/LCM/trilinos-install-serial-clang-release/lib/libmuelu.so.13(_ZN5MueLu15Ifpack2SmootherIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE14SetupChebyshevERNS_5LevelE+0xa52) [0x7fe85bc98aa2]
 6  /home/lcm/LCM/trilinos-install-serial-clang-release/lib/libmuelu.so.13(_ZN5MueLu15Ifpack2SmootherIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE5SetupERNS_5LevelE+0xcee) [0x7fe85bc92e8e]
 7  /home/lcm/LCM/trilinos-install-serial-clang-release/lib/libmuelu.so.13(_ZN5MueLu16TrilinosSmootherIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE5SetupERNS_5LevelE+0xd5) [0x7fe85bf888d5]
 8  /home/lcm/LCM/trilinos-install-serial-clang-release/lib/libmuelu.so.13(_ZNK5MueLu15SmootherFactoryIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE13BuildSmootherERNS_5LevelENS_9PreOrPostE+0x6e3) [0x7fe85bf05693]
 9  /home/lcm/LCM/trilinos-install-serial-clang-release/lib/libmuelu-adapters.so.13(_ZNK5MueLu22SingleLevelFactoryBase9CallBuildERNS_5LevelE+0x38c) [0x7fe85c8f491c]
10  /home/lcm/LCM/trilinos-install-serial-clang-release/lib/libmuelu-adapters.so.13(_ZN5MueLu5Level3GetIN7Teuchos3RCPINS_12SmootherBaseIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS5_6SerialENS5_9HostSpaceEEEEEEEEERT_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPKNS_11FactoryBaseE+0x13f) [0x7fe85c8eea5f]
11  /home/lcm/LCM/trilinos-install-serial-clang-release/lib/libmuelu.so.13(_ZNK5MueLu18TopSmootherFactoryIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE5BuildERNS_5LevelE+0x185) [0x7fe85bf80c15]
12  /home/lcm/LCM/trilinos-install-serial-clang-release/lib/libmuelu.so.13(_ZN5MueLu9HierarchyIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE5SetupEiN7Teuchos3RCPIKNS_18FactoryManagerBaseEEESC_SC_+0x2877) [0x7fe85bc4ef87]
13  /home/lcm/LCM/trilinos-install-serial-clang-release/lib/libmuelu-interface.so.13(_ZNK5MueLu16HierarchyManagerIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE14SetupHierarchyERNS_9HierarchyIdixS6_EE+0x16bc) [0x7fe85c4f76bc]
14  /home/lcm/LCM/lcm-build-serial-clang-release/src/libalbanyLib.so(_ZN5MueLu26CreateXpetraPreconditionerIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEEEN7Teuchos3RCPINS_9HierarchyIT_T0_T1_T2_EEEENS8_IN6Xpetra6MatrixISA_SB_SC_SD_EEEERKNS7_13ParameterListE+0xd46) [0x7fe860a11956]
15  /home/lcm/LCM/lcm-build-serial-clang-release/src/libalbanyLib.so(_ZNK5Thyra26MueLuPreconditionerFactoryIdixN6Kokkos6Compat23KokkosDeviceWrapperNodeINS1_6SerialENS1_9HostSpaceEEEE14initializePrecERKN7Teuchos3RCPIKNS_18LinearOpSourceBaseIdEEEEPNS_18PreconditionerBaseIdEENS_16ESupportSolveUseE+0x43c6) [0x7fe860a08bc6]
16  /home/lcm/LCM/trilinos-install-serial-clang-release/lib/libnox.so.13(_ZNK3NOX5Thyra5Group10updateLOWSEv+0x4e4) [0x7fe85b016b44]

Curiously, the problem does not show up in the gcc build.

@lxmota : would you be able to provide instructions for how I can recreate the nightly clang build on algol to try to debug the problem in the serial run?

Yes, I recommend working on the nightly test directory to minimize the effort:

cd /home/lcm/LCM
source ./env-all.sh
cd "$LCM_DIR"
module purge
module load serial-clang-release
./clean-config-build.sh trilinos 72 -V && ./clean-config-build.sh lcm 72 -V

if you want a debug build just change this line:

module load serial-clang-debug

Curiously the issue does not show up in a debug build... unclear how to proceed. Perhaps one can try switching to an Ifpack2 preconditioner, since the FPE is in MueLu. I can't try this unfortunately since I don't have permissions to the /home/lcm clang release build from the nightlies.

Ok, so it looks like the initial nonlinear solves fail for awhile (maybe the time-step is too big initially), and for some reason, MueLu doesn't like this and barfs with clang. You can see the failed initial nonlinear solves in the gcc build, which runs: https://sems-cdash-son.sandia.gov/cdash/test/1878483 (||F|| stagnates). Switching to Ifpack2 circumvents the problem and leads to the tests passing. I'll check in the fix now and close this tomorrow if the case tests clean.

@ikalash Loads of fun! Thanks for digging into this!

No problem! This was way easier than issue #51 !