Unexpected Error with Increasing MPI ranks related to search alg.
nkzk-stan opened this issue · 2 comments
I am facing unexpected behavior from Nalu. In short, I am rotating a square with a sliding mesh interface at a reasonably low omega (1.5) relative to other cases I have ran for a circle and ellipse.
When I deploy on 1 node and 16 cpus, I ran without a problem.
When I deploy on 1 node and 32 cpus. It provided an error 159:
_Throw number = 159
Throw test that evaluated to true: !std::isfinite(Teuchos::ScalarTraits::magnitude(omega))
Prolongator damping factor needs to be finite.
MueLu::Exceptions::RuntimeError'
what(): /shared/nalu/build/packages/Trilinos/packages/muelu/src/Transfers/Smoothed-Aggregation/MueLu_SaPFactory_def.hpp:228:_
I resubmitted the job with 32 cpus. It provided an error 160:
_Throw number = 160
Throw test that evaluated to true: true
Belos::StatusTestImpResNorm::checkStatus(): One or more of the current implicit residual norms is NaN.
Belos::StatusTestError'
what(): /shared/nalu/build/packages/Trilinos/packages/belos/src/BelosStatusTestImpResNorm.hpp:635:_
This issue was corrected by changing
search_tolerance: 0.05
activate_dynamic_search_algorithm: no
I have attached the input file ( had to change the file to pdf so it would be attached - so just remove the .pdf to access it)
In addition, for another simulation for a square at slower omega (.707), NALU is freezing at a the same timestep. This occured on both 16 and 32 cpus. This has the same input file as the above case with just the omega and timestep changed. This issue was also corrected by using the above fix. This is the last output when NALU would stall:
Time Step Count: 1075 Current Time: 15.5672
dtN: 0.0149393 dtNm1: 0.0149967 gammas: 1.49904 -1.99617 0.497129
Volume 796 min: 0.000463178 max: 0.00877652
NonConformal alg will ghost a new number of entities: 14 and remove 84 entities from ghosting.
DgInfo size overview for name: Current_surface_5__Opposing_surface_55
When I run this case on this resource,
Currently Loaded Modules:
- tbb/2021.1.1 3) compiler/2021.1.1 5) impi/2018
- compiler-rt/2021.1.1 4) intel/2021.1.1 6) gnu8/8.3.0
I also see:
Quad42DSCS::general_face_grad_op: issue..
This was the clue that prompted the suggestion to remove the dynamic search algorithm since the issue rests in serving up a poor opposing element from which the face grad op is required.
Fixes are as follows:
- At the very least, we should throw in the element method before we allow the linear system to be assembled that ultimate causes a NAN.
- We need to re-visit the dynamic tolerance parallel search algorithm. The MPI rank dependency is due to the coarse search no adequately serving up the full set of elements. There may be some corner case with mesh spacing/shape the drives this issue.
- Although the coarse parallel search is processor-count dependent, as long as we have the full set of proper candidates served up, the fine search should find the best candidate.
Best,
Adding input file in non-pdf form.