LLNL/UEDGE

Segfault for ictnunk=1 when case is converged

holm10 opened this issue · 7 comments

holm10 commented

Describe the bug
When a case is converged (fnrm < ftol) and icntnunk is set to 1, the following exmain call causes a segfault

To Reproduce

  1. Navigate to a case that is converged
  2. Ensure the case is well converged by executing exmain until fnrm<ftol
  3. Set bbb.ictnunk=1
  4. Call exmain
  5. Segfault occurs

For the Slab_geometry test case in the pytests, the following commands will reproduce the bug:
from rd_slabH_in_w_h5 import *;bbb.exmain();bbb.exmain();bbb.icntnunk=1;bbb.exmain()

Expected behavior
Execute exmain, print initial fnrm, set iterm=1, return to prompt

Additional context
Has been encountered by several users at various points of using the code, most commonly associated with time-dependent runs where ictnunk is actively switched on and off. Bug does not appear to occur for basis versions.

holm10 commented

Bug traced to odesetup.m conditional:

if (icntnunk==1 .and. ijactot<=1 .and. svrpkg=='nksol') then

Segfault occurs due to xerrab call at:
call xerrab('**Error: need initial Jacobian-pair for icntnunk=1')

Using this information, the expected behavior can be achieved by executing the following commands in the Slab_geometry test directory:
bbb.exmain();bbb.exmain();bbb.icntnunk=1;bbb.ijactot=2;bbb.exmain()

It appears the issue is with ijactot. Presumably, this flag exists to ensure there exists an Jacobian before trying to continue the code execution under the assumption that there is one using icntunk=1. However, it is not clear to me why ijactot=1 is insufficient for the routine to proceed.

holm10 commented

On a general note, xerrab calls in the Python version seems to induce Segfaults rather than the expected behavior of printing the error message and returning to prompt. This should probably be fixed.

holm10 commented

The following line(s) appears to explicitly prohibit ijactot from exeeding 1:

c -- Reinitialize ijactot if icntnunk = 0; prevents ijactot=2 by 2 exmain

However, after the first exmain call, ijactot=2. It appears ijactot is only advaced in calc_jac:
ijactot = ijactot + 1 #note: ijactot set 0 in exmain if icntnunk=0

During the first exmain-call there are 2 Jacobian updates, explaining why ijactot=2. This means exmain can be executed with icntnunk=1 only after calls when more than one Jacobian update was necessary to obtain convergence.

holm10 commented

The suggested solution is to change the conditional on icntnunk=1 to ijactot<1. I don't see why more than one Jacobian evaluation is necessary for exmain calls with icntnunk=1.

holm10 commented

One issue is if the Jacobian calculation is aborted mid-execution: in this case, isjactot=1 but the preconditioner is only partially evaluated, leading to errors if doing an exmain call with ictnunk=1 immediately after aborting, if the conditional is relaxed to ijactot<1. This is due to ijactot += 1 occurring at the beginning of subroutine jac_calc, rather than upon successful completion:

ijactot = ijactot + 1 #note: ijactot set 0 in exmain if icntnunk=0

Moving the above line to aroundabouts here would solve this issue:
jcsc(neq+1) = nnz

However, failing the conditional and causing a xerrab will still segfault, which is a different issue that needs to be resolved

holm10 commented

Suggested fix implemented and tested. Closing as resolved.

This has the same underlying cause as #49. See that item for details.