sandialabs/LCM

LCM test failures on some platforms due to exodiff

ikalash opened this issue · 8 comments

A few LCM tests have been failing on some of the testing platforms for months now due to the exodiff comparison at the end of the run "failing". The culprits are:

CrystalPlasticity_MinisolverStep_Newton - fails on Mayer, AlbanyIntel (CEE)
CrystalPlasticity_MinisolverStep_NewtonLineSearch - fails on Mayer, AlbanyIntel (CEE)
CrystalPlasticity_MinisolverStep_TrustRegion - fails on Mayer, AlbanyIntel (CEE)
CrystalPlasticity_ThermallyActivatedSlip - fails on Mayer, Blake (Serial and OpenMP build),
HeliumODEs_HeBubblesDecay - fails on Blake (Serial and OpenMP build)

I would like to see the dashboard clean, as having tests failing consistently for months makes it more difficult to spot new failures / monitor code health. Can one of the LCM folks (@lxmota , @calleman21 , @jwfoulk ) please have a look at the output from the tests to determine if the failures are trivial or not? I think some of the failures may be due to having a very small tolerance passed to exodiff, which may not be quite satisfied for all platforms. The following are ways we could move forward:

1.) Increase the exodiff tolerance so that the tests pass everywhere (if it is determined that the results on the platforms where the tests fail are OK).
2.) Create a machine-specific exodiff file for the problematic platforms (again, if it is determined that the results on the platforms where the tests fail are OK)
3.) Add a configure flag to turn off the problematic tests, and activate it on the problematic platforms.

3.) is not the best option but I will do this if there is no clear path towards getting the tests passing.

One can see the output from the failures here:

@etphipp, @lxmota : I am going to move the FPE discussion to this issue, since it makes more sense.

I changed the clang build to a debug one and strangely enough the FPEs for the particular problem I was looking at (HeliumODEs) went away. I'm somewhat stumped by why the FPE would not show up in a debug build...

I'm not sure what is best to have in the way of nightly testing, the Clang + FPE check non-debug build, or the Cland + FPE check debug build. It seems the latter one will have less failures with LCM enabled, strangely enough.

You don't necessarily have to do a debug build to get the stack trace. You can do an optimized build and just add "-g" to your cxx_flags.

@etphipp : yes, I know. My comment was a more general one - it's weird that an all-debug build would not have the same FPE behavior as an optimized one.

Yeah I do agree that is strange.

@etphipp : here is the full backtrace with line numbers from the problem we were talking about earlier:

************************************************************************
-- Nonlinear Solver Step 0 -- 
||F|| = 0.000e+00  step = 0.000e+00  dx = 0.000e+00
************************************************************************

 Phalanx writing graphviz file for graph of Jacobian fill (detail =1)
 Process using 'dot -Tpng -O phalanx_graph' 
 

Program received signal SIGFPE, Arithmetic exception.
0x0000000003feaad7 in Sacado::Fad::GeneralFad<double, Sacado::Fad::DynamicStorage<double, double> >::operator=<Sacado::Fad::MultiplicationOp<Sacado::Fad::Expr<Sacado::Fad::MultiplicationOp<Sacado::Fad::ConstExpr<double>, Sacado::Fad::Expr<Sacado::Fad::GeneralFad<double, Sacado::Fad::DynamicStorage<double, double> >, Sacado::Fad::ExprSpecDefault> >, Sacado::Fad::ExprSpecDefault>, Sacado::Fad::Expr<Sacado::Fad::SubtractionOp<Sacado::Fad::Expr<Sacado::Fad::GeneralFad<double const, Sacado::Fad::ViewStorage<double const, 0u, 1u, Sacado::Fad::DFad<double> > >, Sacado::Fad::ExprSpecDefault>, Sacado::Fad::Expr<Sacado::Fad::DivisionOp<Sacado::Fad::ConstExpr<double>, Sacado::Fad::Expr<Sacado::Fad::GeneralFad<double const, Sacado::Fad::ViewStorage<double const, 0u, 1u, Sacado::Fad::DFad<double> > >, Sacado::Fad::ExprSpecDefault> >, Sacado::Fad::ExprSpecDefault> >, Sacado::Fad::ExprSpecDefault> > > ()
    at /ascldap/users/ikalash/glensBuilds/nightlies/repos/Trilinos/build-clang-mg/install/include/Sacado_Fad_GeneralFad.hpp:279
279	            SACADO_FAD_DERIV_LOOP(i,sz)
Missing separate debuginfos, use: debuginfo-install cyrus-sasl-lib-2.1.23-15.el6_6.2.x86_64 glibc-2.12-1.209.el6_9.1.x86_64 keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64 libcom_err-1.41.12-23.el6.x86_64 libcurl-7.19.7-53.el6_9.x86_64 libidn-1.18-2.el6.x86_64 libselinux-2.0.94-7.el6.x86_64 libssh2-1.4.2-2.el6_7.1.x86_64 nspr-4.13.1-1.el6.x86_64 nss-3.28.4-1.el6_9.x86_64 nss-softokn-freebl-3.14.3-23.3.el6_8.x86_64 nss-util-3.28.4-1.el6_9.x86_64 numactl-2.0.9-2.el6.x86_64 openldap-2.4.40-16.el6.x86_64 openssl-1.0.1e-57.el6.x86_64
(gdb) bt 
#0  0x0000000003feaad7 in Sacado::Fad::GeneralFad<double, Sacado::Fad::DynamicStorage<double, double> >::operator=<Sacado::Fad::MultiplicationOp<Sacado::Fad::Expr<Sacado::Fad::MultiplicationOp<Sacado::Fad::ConstExpr<double>, Sacado::Fad::Expr<Sacado::Fad::GeneralFad<double, Sacado::Fad::DynamicStorage<double, double> >, Sacado::Fad::ExprSpecDefault> >, Sacado::Fad::ExprSpecDefault>, Sacado::Fad::Expr<Sacado::Fad::SubtractionOp<Sacado::Fad::Expr<Sacado::Fad::GeneralFad<double const, Sacado::Fad::ViewStorage<double const, 0u, 1u, Sacado::Fad::DFad<double> > >, Sacado::Fad::ExprSpecDefault>, Sacado::Fad::Expr<Sacado::Fad::DivisionOp<Sacado::Fad::ConstExpr<double>, Sacado::Fad::Expr<Sacado::Fad::GeneralFad<double const, Sacado::Fad::ViewStorage<double const, 0u, 1u, Sacado::Fad::DFad<double> > >, Sacado::Fad::ExprSpecDefault> >, Sacado::Fad::ExprSpecDefault> >, Sacado::Fad::ExprSpecDefault> > > ()
    at /ascldap/users/ikalash/glensBuilds/nightlies/repos/Trilinos/build-clang-mg/install/include/Sacado_Fad_GeneralFad.hpp:279
SNLComputation/Albany#1  0x0000000004294466 in LCM::J2Model<PHAL::AlbanyTraits::Jacobian, PHAL::AlbanyTraits>::computeState(PHAL::Workset&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Teuchos::RCP<PHX::MDField<Sacado::Fad::DFad<double> const, void, void, void, void, void, void, void, void> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, Teuchos::RCP<PHX::MDField<Sacado::Fad::DFad<double> const, void, void, void, void, void, void, void, void> > > > >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, Teuchos::RCP<PHX::MDField<Sacado::Fad::DFad<double>, void, void, void, void, void, void, void, void> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, Teuchos::RCP<PHX::MDField<Sacado::Fad::DFad<double>, void, void, void, void, void, void, void, void> > > > >) ()
    at /ascldap/users/ikalash/glensBuilds/nightlies/repos/Trilinos/build-clang-mg/install/include/Sacado_Fad_DFad_tmpl.hpp:158
SNLComputation/Albany#2  0x0000000003f1de64 in LCM::ConstitutiveModelInterface<PHAL::AlbanyTraits::Jacobian, PHAL::AlbanyTraits>::evaluateFields(PHAL::Workset&) () at /ascldap/users/ikalash/glensBuilds/nightlies/repos/Albany/src/./LCM/models/ConstitutiveModelInterface_Def.hpp:240
SNLComputation/Albany#3  0x0000000001c3e815 in PHX::DagManager<PHAL::AlbanyTraits>::evaluateFields(PHAL::Workset&) ()
    at /ascldap/users/ikalash/glensBuilds/nightlies/repos/Trilinos/build-clang-mg/install/include/Phalanx_DAG_Manager_Def.hpp:450
SNLComputation/Albany#4  0x0000000001c1ecc5 in void PHX::FieldManager<PHAL::AlbanyTraits>::evaluateFields<PHAL::AlbanyTraits::Jacobian>(PHAL::Workset&) ()
    at /ascldap/users/ikalash/glensBuilds/nightlies/repos/Trilinos/build-clang-mg/install/include/Phalanx_FieldManager_Def.hpp:321
SNLComputation/Albany#5  0x0000000001c03ba8 in Albany::Application::computeGlobalJacobianImpl(double, double, double, double, Teuchos::RCP<Thyra::VectorBase<double> const> const&, Teuchos::RCP<Thyra::VectorBase<double> const> const&, Teuchos::RCP<Thyra::VectorBase<double> const> const&, Teuchos::Array<Sacado::ScalarParameterVector<SPL_Traits> > const&, Teuchos::RCP<Thyra::VectorBase<double> > const&, Teuchos::RCP<Thyra::LinearOpBase<double> > const&, double) () at /ascldap/users/ikalash/glensBuilds/nightlies/repos/Albany/src/Albany_Application.cpp:1677
SNLComputation/Albany#6  0x0000000001c08caf in Albany::Application::computeGlobalJacobian(double, double, double, double, Teuchos::RCP<Thyra::VectorBase<double> const> const&, Teuchos::RCP<Thyra::VectorBase<double> const> const&, Teuchos::RCP<Thyra::VectorBase<double> const> const&, Teuchos::Array<Sacado::ScalarParameterVector<SPL_Traits> > const&, Teuchos::RCP<Thyra::VectorBase<double> > const&, Teuchos::RCP<Thyra::LinearOpBase<double> > const&, double) () at /ascldap/users/ikalash/glensBuilds/nightlies/repos/Albany/src/Albany_Application.cpp:1870
---Type <return> to continue, or q <return> to quit---
SNLComputation/LCM#1  0x0000000001c79afa in Albany::ModelEvaluatorT::evalModelImpl(Thyra::ModelEvaluatorBase::InArgs<double> const&, Thyra::ModelEvaluatorBase::OutArgs<double> const&) const ()
    at /ascldap/users/ikalash/glensBuilds/nightlies/repos/Albany/src/Albany_ModelEvaluatorT.cpp:895
SNLComputation/Albany#8  0x0000000001b614fa in Thyra::ModelEvaluatorDefaultBase<double>::evalModel(Thyra::ModelEvaluatorBase::InArgs<double> const&, Thyra::ModelEvaluatorBase::OutArgs<double> const&) const ()
    at /ascldap/users/ikalash/glensBuilds/nightlies/repos/Trilinos/build-clang-mg/install/include/Thyra_ModelEvaluatorDefaultBase.hpp:691
SNLComputation/Albany#9  0x0000000001bd8177 in Thyra::DefaultModelEvaluatorWithSolveFactory<double>::evalModelImpl(Thyra::ModelEvaluatorBase::InArgs<double> const&, Thyra::ModelEvaluatorBase::OutArgs<double> const&) const ()
    at /ascldap/users/ikalash/glensBuilds/nightlies/repos/Trilinos/build-clang-mg/install/include/Thyra_DefaultModelEvaluatorWithSolveFactory.hpp:325
SNLComputation/Albany#10 0x0000000001b614fa in Thyra::ModelEvaluatorDefaultBase<double>::evalModel(Thyra::ModelEvaluatorBase::InArgs<double> const&, Thyra::ModelEvaluatorBase::OutArgs<double> const&) const ()
    at /ascldap/users/ikalash/glensBuilds/nightlies/repos/Trilinos/build-clang-mg/install/include/Thyra_ModelEvaluatorDefaultBase.hpp:691
SNLComputation/Albany#11 0x0000000005b8c6df in LOCA::Thyra::Group::computeJacobian() ()
    at /ascldap/users/ikalash/glensBuilds/nightlies/repos/Trilinos/packages/nox/src-loca/src-thyra/LOCA_Thyra_Group.C:180
SNLComputation/Albany#12 0x0000000005c100f8 in LOCA::MultiContinuation::ConstrainedGroup::computeJacobian() ()
    at /ascldap/users/ikalash/glensBuilds/nightlies/repos/Trilinos/packages/nox/src-loca/src/LOCA_MultiContinuation_ConstrainedGroup.C:357
SNLComputation/Albany#13 0x0000000005e476ad in NOX::Direction::Newton::compute(NOX::Abstract::Vector&, NOX::Abstract::Group&, NOX::Solver::Generic const&)
    () at /ascldap/users/ikalash/glensBuilds/nightlies/repos/Trilinos/packages/nox/src/NOX_Direction_Newton.C:131
SNLComputation/Albany#14 0x0000000005dc38fa in NOX::Solver::LineSearchBased::step() ()
    at /ascldap/users/ikalash/glensBuilds/nightlies/repos/Trilinos/packages/nox/src/NOX_Solver_LineSearchBased.C:194
SNLComputation/Albany#15 0x0000000005dc3c39 in NOX::Solver::LineSearchBased::solve() ()
    at /ascldap/users/ikalash/glensBuilds/nightlies/repos/Trilinos/packages/nox/src/NOX_Solver_LineSearchBased.C:260
SNLComputation/Albany#16 0x0000000005c31b2f in LOCA::Stepper::start() ()
    at /ascldap/users/ikalash/glensBuilds/nightlies/repos/Trilinos/packages/nox/src-loca/src/LOCA_Stepper.C:372
SNLComputation/Albany#17 0x0000000005bfc21a in LOCA::Abstract::Iterator::run() ()
    at /ascldap/users/ikalash/glensBuilds/nightlies/repos/Trilinos/packages/nox/src-loca/src/LOCA_Abstract_Iterator.C:122
SNLComputation/Albany#18 0x0000000004d3055f in Piro::LOCASolver<double>::evalModelImpl(Thyra::ModelEvaluatorBase::InArgs<double> const&, Thyra::ModelEvaluatorBase::OutArgs<double> const&) const ()
---Type <return> to continue, or q <return> to quit---
    at /ascldap/users/ikalash/glensBuilds/nightlies/repos/Trilinos/packages/piro/src/Piro_LOCASolver_Def.hpp:197
SNLComputation/Albany#19 0x0000000001b614fa in Thyra::ModelEvaluatorDefaultBase<double>::evalModel(Thyra::ModelEvaluatorBase::InArgs<double> const&, Thyra::ModelEvaluatorBase::OutArgs<double> const&) const ()
    at /ascldap/users/ikalash/glensBuilds/nightlies/repos/Trilinos/build-clang-mg/install/include/Thyra_ModelEvaluatorDefaultBase.hpp:691
SNLComputation/Albany#20 0x0000000001a5d749 in void Piro::Detail::PerformSolveImpl<double, Thyra::VectorBase<double> const, Thyra::MultiVectorBase<double> const>(Thyra::ModelEvaluator<double> const&, Teuchos::Array<bool> const&, bool, Teuchos::Array<Teuchos::RCP<Thyra::VectorBase<double> const> >&, Teuchos::Array<Teuchos::Array<Teuchos::RCP<Thyra::MultiVectorBase<double> const> > >&, Teuchos::RCP<Piro::SolutionObserverBase<double, Thyra::VectorBase<double> const> >) ()
    at /ascldap/users/ikalash/glensBuilds/nightlies/repos/Trilinos/build-clang-mg/install/include/Piro_PerformSolve_Def.hpp:120
SNLComputation/Albany#21 0x0000000001a5cae6 in void Piro::Detail::PerformSolveImpl<double, Thyra::VectorBase<double> const, Thyra::MultiVectorBase<double> const>(Thyra::ModelEvaluator<double> const&, Teuchos::ParameterList&, Teuchos::Array<Teuchos::RCP<Thyra::VectorBase<double> const> >&, Teuchos::Array<Teuchos::Array<Teuchos::RCP<Thyra::MultiVectorBase<double> const> > >&) ()
    at /ascldap/users/ikalash/glensBuilds/nightlies/repos/Trilinos/build-clang-mg/install/include/Piro_PerformSolve_Def.hpp:162
SNLComputation/Albany#22 0x0000000001a4aa75 in main ()
    at /ascldap/users/ikalash/glensBuilds/nightlies/repos/Trilinos/build-clang-mg/install/include/Piro_PerformSolve_Def.hpp:291

As I pointed out earlier, gdb doesn't identify the line that is problematic in Albany - it is just a generic ConstitutiveModelInterface routine. The complexity of the material model code in Albany makes it difficult/tedious to resolve all possible FPEs (at least for me).

Thanks @ikalash! Unfortunately because of inlining, it isn't showing the exact line in LCM. But we can deduce it is in LCM::J2Model::computeState() here. By mentally parsing the expression, it looks it is "Fad = const * Fad * (Fad - const/Fad)" which appears to only match line 298 near the end:

      // compute pressure
      p = 0.5 * kappa * (J(cell, pt) - 1. / (J(cell, pt)));

So it must be the case that J(cell,pt) == 0. There is also a similar division by J on the next line:

      // compute stress
      sigma = p * I + s / J(cell, pt);

How you want to address this is up to you, but many people have done something like this in the past:

      // compute pressure
      p = J(cell,pt) == 0.0 ? ScalarT(0.0) : ScalarT(0.5 * kappa * (J(cell, pt) - 1. / (J(cell, pt))));

       // compute stress
      sigma = J(cell,pt) == 0.0 ? ScalarT(0.0) : ScalarT(p * I + s / J(cell, pt));

The casts to ScalarT are necessary because of restrictions on the types returned by the ternary operator.

@etphipp @ikalash I can introduce these modifications, but J being zero is a symptom of something going very wrong.

@etphipp : thanks for looking at it. I agree that those lines can be problematic. Unfortunately if make the fix you suggest, I still get FPEs (I had gone through this exercise before, which led me to punt on the issue). There are a lot of divisions in the J2Model file - it would be best for someone like @lxmota who is more acquainted with the material models than me to look at this. Alejandro, thanks for offering to do so. I am attaching the configure scripts to reproduce the issue on CEE. I build/run on cee-compute016 (the modules aren't available on all compute nodes).
modules_clang.sh.txt
do-cmake-cmake-albany-cee-clang.txt
do-cmake-trilinos-cee-clang.txt