PrincetonUniversity/athena

inter-block divergence issue when refinining under certain circumstances, generic to all coordinates

bitwalkabout opened this issue · 39 comments

The issue involves large magnetic divergence measured in the first row/plane of ghost zones outside the active region at an inter-mesh boundary. This issue arrises in a very specific circumstance: if left meshblock A is higher resolution than right meshblock B, and the refinement boundary moves to the right such that meshblock B matches the resolution of A, the first ghost zones have a large divergence, capable of affecting evolution. This seems to be true for all three dimensions in 3D, but only when the coarse meshblock side is refined, not when the fine meshblock is restricted.

I first came across this bug because of off-disk issues running the native Kerr-Schild coordinates with AMR turned on (forced to be axisymmetric). The recent divergence fixes related to curvilinear coordinates fixed other instances of large divergence -- namely, inside the active domain -- but I later saw hydrodynamic discontinuities that appeared like divergence-led errors so I checked inside the ghost zones and found the divergence there.

I hypothesized a cause that would be generic to flat spacetime, cartesian coordinates, so turned attention to the rayleigh taylor 3d problem with AMR.

To replicate, consider commit 1591aab, running the RT with the following configuration:

configure.py -b --prob rt --flux hlle --nghost 4 -mpi -hdf5 -h5double

I change the refinement condition in rt.cpp to:

\-  if (drmax > 1.5) return 1;
\-  else if (drmax < 1.2) return -1;
\+  if (drmax > 1.1) return 1;
\+  else if (drmax < 1.05) return -1;

and, rather than looking at every meshblock output individually, to get a rough look at the ghost zone divergence, I calculate the divergence for all active cells -3 (shifting left), and write this to a user defined output variable. This is the only other change to rt.cpp.

I use athinput.rt3d with only a few changes:

\> I output to hdf5
\> xorder ->3
\> for speed  I reduced resolution to
nx1 = 36, nx2=6, nx3=36
\> I turned on mesh refinement:
refinement = adaptive
numlevel       = 3
derefine_count    = 5

\<meshblock\>
nx1 = 6  
nx2 = 6  
nx3 = 6  

\> also changed to: 
iprob = 3
amp   = 0.5
drat  = 1.4
b0 = .0001
angle = 30.

The Kerr-Schild issue appears as divergence lines (log10(abs(divergence of B, normalized in usual way by sqrt(bsq) and cell volumes/lengths))):
image

A cut through the 6th (so really the 3rd) active cell in the output in the x-z plane for the 3d rt problem shows similar behavior:
image
Though, I can't yet account for the alternating nature of the divergence in this case. Given the AMR tree structure, I'd guess it's related to that but I am not sure.

My hypothesis for what is happening is the following:

On the coarse side of meshblock B I was describing before, prolongation happens at the faces of cells. Even in KS, with the updates from a few months back, the total magnetic flux summed/integrated across the coarse cell surface, or the four new, is the same. However, at the single face shared between A and B, the prolongation on B will result in 4 values that are different than the four sharing the fine faces on A. These values should be identical, and everything about the logic of the code for constrained transport relies on the assumption that the two meshblocks and their relevant CPUs hold a copy of this identical value, when is then identically changed through emfs ensured to be the same through the flux correction protocols.

However, because refinement broke this assumption, divergence exists in the first ghost zones and the behavior will no longer maintain consistent CT evolution. Because the prolongation is fairly accurate to the magnetic structure, the values are not absurdly off, so the hydrodynamic effect can be subtle, unless (as I exactly see in the accretion disk case) the refinement embeds a divergence associated with large fields and a high density region, but then the flow can move some and lead to the same divergence in a region of much less true hydrodynamic pressure. I have a video of this affect on hydrodynamic evolution if it's of interested -- showing a discontinuity in density and other quantities at the location of the refinement boundary before it moved.

The fix would be simple -- use the copies of the four face B field values that B can reveive and, once, upon refinement, overwrite the four prolongated values. Perhaps a few other values of flux on faces shared between those newly refined edge cells should also need to be adjusted to accomodate the overwriting values to maintain the constraint.

I tried to implement such a fix but the comminication of face magnetic field values as it interweaves the AMR code was just too complex for me to quickly get a handle on. @tomidakn Perhaps you can point me to the changes necessary for a quick test of this idea? Or, perhaps I'm misunderstanding some of the shared face B-field update procedure.

It appears to me that the refinement happens at the end of the main loop, and I couldn't find an overwrite of the new face fields like this in the code (of course, maybe I missed it), but that would support the idea that this is the cause of the issue.

I compile and run with intel 20.0.4 and mvapich2. i can get specific verisons if necessary.
Some flags I compiled with for the cases above (though it is generic to at least some other flag choices, like removal of O3):

CXXFLAGS := -O3 -std=c++11 -fp-model=strict -qno-openmp-simd

Just an additional comment. I think that because the divergence isn't too too bad, in tests where the average flow properties don't trend down in time too much, which is all of them? I think one might still get passing convergence tests, etc. One might not even notice if one didn't look at divergence in the ghost zones themselves. Though, as I said, I still can't explain why it's only in the ghost zones for the KS test but alternates first ghost zones and active zones in the cartesain RT test.

I think I've convinced myself that after overwriting the new face-centered values with the block A values previously active, then one has to solve for and overwrite the values of the newly prolongated face-centered fluxes on the meshblock B faces shared between each four newly refined cells. Four vars, four constraint equations. Does the function ProlongateSharedFieldX[1,2,3] have access to the shared inter-block face values from block A? I'm unsure of the var these would be stored in at this point in code evolution and how to access it from this function.

Dear @bitwalkabout

Thank you very much for catching this and detailed report. It was very helpful. After quick inspection, I do agree that this is a serious issue and must be fixed.

We need to transfer and set the shared face fields from neighbor MeshBlocks that are already refined before ProlonateInternalField. While it is straightforward, it is a little bit tricky to implement as the communication infrastructure is being reconstructed at this point. I have an idea, but I am currently quite busy. Can someone help me if I give directions how to do that?

@tomidakn thanks for taking a look and responding so quickly!

I am not sure what would be more computationally efficient, but one could, I think, in lieu of changing the code to transfer the field values on the shared face only when refinement in this direction occurs, reconstruct the shared-face values using each cell's other faces. If I understand correctly, this could be done using the face information already exchanged (though perhaps the ordering of refinement of the block and exchange are out of step in a way that would be hard to circumvent). In order to ensure exact value duplication on both blocks, the shared face could be reconstructed on both blocks.

Certainly this isn't as clean as sending those extra face values between blocks, but it might be quicker to implement in the version currently in master branch, and then the better solution could be implemented along with other changes for the new comm updates you mentioned?

In the current implementation, MeshBlock B does not have the four shared-face fields from MeshBlock A before refinement, and after refinement (now they are on the same resolution), the shared fields are not transferred between them. So anyway we need to change the communication.

Also, when we change the shared-face fields, we have to recalculate the other face fields of the first active cells in order to satisfy the divergence condition in the active cells.

Oh, I see, before the timestep involving AMR refinement, during each inter-block communication call, restriction takes place on Meshblock A before filling the comm buffer to send to B? I was misremembering, thinking that the restriction was done on the receiving Meshblock B.

If the latter were the case, clarifying what I meant before: Though the four shared-faced fields are not transfered, the other 16 unique face field values (not shared with block B but shared amongst the ghost zones for block B) of those four ghost cells, are communicated -- or, at least, the infrastructure seems to be in place to perform a inter-block communication to fill those values with the new ones from the timestep for which this refinement occurs at the end. Those 16 values could be used to reconstruct the currently uncommunciated shared (between A and B) face values. This reconstruction could be done on both block A and B so that the same reconstructed value is used moving forward. Certainly not as clean as adding the new communication but potentially a quick fix to the older version of the code.

As you said, if restriction is happening before, then I agree, no other way than to add a way of communicating the extra data, then deal with updating the first active cells so that the new prolongated values preserve divB for those cells.

Correct, the shared field on A is first restricted and then transferred to B (to save the data size).

I think your suggestion may work, but I am still concerned that we may need to carefully apply the correction when edge and corner neighbors are refined.

My idea is to transfer and set the shared fine fields before ProlongateInternalField during the mesh refinement process, so that we do not have to do the calculation twice.

I spent my weekend for this issue and I think now it works. I pushed it to the amrfix branch. As I am currently very busy, I cannot cover many test cases, so @bitwalkabout can you try it? I guess the latest commit with using unordered_map (hash) should be faster, but I appreciate if you could try the older one as well.

First, it identifies pairs of contacting MeshBlocks, one on the coarser level and to be refined. Then, it sends shared face fields from the finer MeshBlock to the coarser one. On the coarser MeshBlock, it uses the received fields for the shared face instead of ones prolongated from the coarser level.

@jmstone @pgrete @jdolence I would like to draw your attention, because this probably affects AthenaK, AthenaPK and KAthena.

No problem, thank you for your help.

Thanks for the heads-up.

@lroberts36 does this still apply to Parthenon given the ownership model?

@pgrete, I believe Parthenon should not have this problem. During refinement, we prolongate from the old, coarse block to the shared faces of the new fine block. Then we communicate boundaries and faces shared between blocks, assuming that the fine blocks that already exist "own" the shared data. After that, prolongation occurs on faces that were not shared with the coarse block.

IIRC, @bprather ran into a similar (maybe the same) issue when testing out the face field implementation in Parthenon. We had to modify the ownership model slightly to make things work (i.e. giving precedence to pre-existing blocks at a given level, but only during post-refinement communication).

@lroberts36 @pgrete It sounds like Parthenon already took care of this issue.

Great, thanks for confirming @lroberts36
Now we just need to implement CT in AthenaPK ;)

@tomidakn I've found some time to start testing (working with the RT setup first) and there do seem to be remaining issues. Both the third new commit and the first fail after running for a while (maybe a memory de/allocation or mpi-comm buffer allocation error? just a guess). Though the divergence is now fine under most circumstances that were originally the issue, there is still divergence showing up -- eventually -- when the fixup is applied. My guess would be the logic of the fix is implemented in a working manner, but there's a small bug that is causing both the runtime error and breaking the new flux fixes on occasion, or after a certain number of cycles. I'm trying to find out more details about where it's failing, and will share if I find anything telling.

OK, I will take a look, but it may take a while as I am very busy right now. I tested only a few timesteps, so the long-term behavior was not tested.

Can you share your test set up?

Test setup is the same as the 3d RT setup I described in my first post, but I have changed drat to 2 so the evolution is a bit quicker, and am currently looking at only one amr refinement over level 0.

I ran valgrind and no memory error was detected for a serial run.

Can you explain how it crashed? Was it an MPI error, segmentation fault or something else? Also I'd appreciate if you could explain under what situation and where the divB error appeared.

@tomidakn I've included some responses below:

Unfortunately, I'm not getting a consistent error with every run. For xorder = 3, with the 3D RT setup, your first commit, running with MPI, and making the rotation of the field 90 deg and reflecting boundary conditions (since, as I understand it, your first commit isn't compatible with periodic boundaries?), and otherwise the RT setup described just above, I get the error message upon crash:

[atl1-1-02-017-29-1.pace.gatech.edu:mpi_rank_32][MPIDI_CH3_Abort] Fatal error in MPI_Irecv:
Invalid rank, error stack:
MPI_Irecv(153): MPI_Irecv(buf=0x36f9f30, count=36, MPI_DOUBLE, src=113, tag=1057, MPI_COMM_WORLD, request=0x2c6cda0) failed
MPI_Irecv(98).: Invalid rank has value 113 but must be nonnegative and less than 36
: No such file or directory (2)

In the same setup, but xorder = 2 and two meshblocks in the 2nd dimension, rather than 1, it fails on the error I've added checking pmb was not null, which was of concern to me based on some gdb traceback results I was getting:

// Step FFC4. Pack and send shared face fields
for (FaceFieldCorrection &f : ffc_send_) {
MeshBlock *pmb = FindMeshBlock(f.from);
int p = 0;
if (pmb == nullptr) {
std::cerr << "Error: MeshBlock not found for f.from" << std::endl;
continue; // Skip this iteration
}

for (FaceField &var_fc : pmb->vars_fc_) {
switch (f.face) {
case BoundaryFace::inner_x1:
BufferUtility::PackData(var_fc.x1f, f.buf, pmb->is, pmb->is,
...

Regarding the divergence -- it seems to now occur only on x1 and x2 faces (at least for this problem setup) (or possibly travelling in that direction along faces from block corners during prolongation?), and happens along shared block faces between blocks when they are both refined. In other words, when the refinement boundary moves in the x3 direction, extending to include two newly refined blocks that share a x1 or x2 face, there is divergence along that face.

Could you remind me of how shared faces between blocks that are both newly refined are gauranteed to be the same? Either that is also a problematic area that needs to be similar addressed with it's own fixup routine, or I've simply forgotten the step where those are handled from when I was going through that part of the code.

The algorithm does not depend on xorder, so let us focus on xorder=2. Also let's focus on the latest commit, and put aside the divergence issue. The MPI issue is more fundamental and critical. However, I could not reproduce your issue with the RT problem even after 2000 steps on my computer using 36 ranks and with periodic boundaries.

By the way, it is of crucial importance to keep all the optimization options (particularly -ipo for intel). Otherwise it can mess up vectorization. If you want to try something conservative, please configure the code with -d option.

@tomidakn I typically notice the problem arrise closer to 10k steps or even a few times that -- this is for the case of the first commit, non-periodic boundaries, etc, that I described running above. Up to that time, aside from the divb issue, there was no MPI issue apparent. Maybe just try running longer. If you got to 50k steps with my small adjustments to the RT setup, including the block sizes and total resolution, and physical parameters setting the ultimate rate at which the refinement levels move, then I'd start to suspect machine differences. Tomorrow I will compile and run on a TACC machine to make sure the MPI issue isn't something local to our cluster at GT.

Regarding the optimizations, using the configure command in my original post, and the configure.py file in these commits, no "ipo" flags are set when using the -mpi flag (using mpicxx to compile). Do you have something different? The latter flags, strict and qno-omp... that I run with are from reading through prior issues opened by others, and testing of my own, because I got non-deterministic behavior at runtime if I left all the default vectorization and omp flags as-is. I can try running with -ipo flag added, though and will report back.

I will also try the latest commit again, and report anything I can from running with gdb.

-fp-model=strict -qno-openmp-simd should be fine, I just wanted to make sure that correct compiler options are important.

@tomidakn Were you able to replicate the failure by running that setup longer?

@tomidakn And any others who might be working on the issue:

I have replicated this on Stampede3 with newer intel compilers. Whether I run with the most recent commit on AMRFIX branch or the first, I get the error at ~7000 steps (coarse resolution 36x6x36, with meshblocks of 6x6x6 for most recent tests) and the same backtrace route, to the same line (backtrace results on master rank):

(gdb) bt
#0 0x00000000005d1a54 in __gnu_cxx::__normal_iterator<std::reference_wrapper*, std::vector<std::reference_wrapper, std::allocator<std::reference_wrapper > > >::__normal_iterator (this=0x7fffffff7578, __i=)
at /opt/apps/gcc/13.2.0/lib/gcc/x86_64-pc-linux-gnu/13.2.0/../../../../include/c++/13.2.0/bits/stl_iterator.h:1077
#1 0x00000000005cf9b9 in std::vector<std::reference_wrapper, std::allocator<std::reference_wrapper > >::begin (this=0x1e8) at /opt/apps/gcc/13.2.0/lib/gcc/x86_64-pc-linux-gnu/13.2.0/../../../../include/c++/13.2.0/bits/stl_vector.h:871
#2 0x00000000005cc21d in Mesh::PrepareAndSendFaceFieldCorrection (this=0x97a830, newloc=0x3a7ccf0, ranklist=0x34b1510, newrank=0x3719ed0, nslist=0x97eee0, nbtold=190) at src/mesh/amr_loadbalance.cpp:1397
#3 0x00000000005c8ff1 in Mesh::RedistributeAndRefineMeshBlocks (this=0x97a830, pin=0x976a40, ntot=204) at src/mesh/amr_loadbalance.cpp:551
#4 0x00000000005c6ae9 in Mesh::LoadBalancingAndAdaptiveMeshRefinement (this=0x97a830, pin=0x976a40) at src/mesh/amr_loadbalance.cpp:56
#5 0x00000000004ceb8d in main (argc=5, argv=0x7fffffff83a8) at src/main.cpp:539

if pmb is null, which it seems to be on occasion, then the for loop is failing. It seems unlikely that it would be consistent with the logic of the new code additions, but I will try adding an exception to skip these portions when there are no meshblocks assigned to pmb here by FindMeshBlock(). possibly extraneous info: If I do a restart with a different number of tasks, right before the failure, the failure does not happen at the same time but often is delayed.

Without sitting down over zoom and at least having a quick discussion about the design of the new additions, that's about the most I can do at this point, but I will keep trying since this is holding up a main project line we have at Georgia Tech. I am happy to find a time to meet with anyone available (whatever time would be convenient regardless of timezone). Thanks again for your help!

OK. Somehow I could not reproduce your issue as all the MeshBlocks are derefined to the root level after ~2000 steps. Can you share your input and pgen files?

It sounds like a resource management issue, probably related to MPI.

@tomidakn The two requested files can be found here:

https://drive.google.com/drive/folders/1GKC3KmQWqOjxPW0hMkbJmheC8ro0Kmwy?usp=sharing

Just to double check, had you run with the athinput file changes I mentioned above? -- I'm pretty sure that with the default physical setup, default perturbation, and B-field strength, I also got early evolution (at this relatively low resolution) that showed sufficient mixing of the contrasting densities at the interface that the higher refinement level dropped away. Hence I made a few changes to diminish B-field stabilization of high-k modes, increase rate of growth, and increase density contrast. Those are included, of course, in the linked file. Also included in the pgen file will be the user output additions for the offset divB user output variable.

I thought I changed them, but maybe I did not. I use your files.

OK, I found it. The hashmap was not cleared correctly, and wrong meshblocks were identified in certain situations. I only checked it ran without crashing, so can you test quantitatively? I hope this also fixes the divergence error.

I've got an idea to improve the AMR performance a little bit, so I will update it when I have time.

@tomidakn Great! I don't want to jinx it, but a quick repeat of the test I've been running (with two different numbers of cores) and it is quite a bit past the prior point of failure and continuing to run. Also, the divergence issue I saw before is not present despite most of the domain having been refined by a level. I think we've got it!

I'll do a more thorough test suite later today, with more levels of refinement and test that it works in the original problematic context of GR, but I wanted to get a quick reply to you.

Great, please keep me updated.

Divergence and the MPI failure from before do seem to be fixed, both in the RT testing setup and the Kerr-Schild disk problem. However, now that I apply mesh refinement during a restart (i.e. I have a changed refinement criterion upon restart), in the newly refined regions, I have pixels, somewhat randomly distributed, correlated with low-density (or, higher bsq/rho or lower plasma beta) structures, that seem to be reset to the floor incorrectly. So, it seems like either the interpolation of primitives (at least density, i'm checking others) has large error in some cells, or some cells aren't being filled correctly, and the floors are kicking in.

However, this is a restart of a rst file from a slightly older commit of master branch, using the newest master branch commits. Has backward compatibility been tested?

I'm going to try now doing restart tests (new commit -> new commit and old commit->new commit) in the RT context to see if the issue appears there as well.

I did notice that upon restart, when the refinement occurs, I was not getting the normal notification in output that is associated with large AMR step changes -- 'might have to adjust to new meshblock count' or however it's exactly phrased. Not sure if this is inidicative in a helpful way or not.

I haven't double checked the ordering yet, but depending on the step at which reconstruction of face-centered velocity, and thus associated fluxes takes place, is that going to have similar issues to the divB issue now solved? The velocity primitive values in the problematic cells during restart are 0 in the dump after the first step after refinement (even now that I've set things up so refinement happens several steps after restart). If the fluxes are way off for the first step after refinement then that could explain why the density is off. At least, that seems consistent with what I'm seeing.

I'm sorry but I do not get your point. Please provide figures and how to reproduce the issue. If it happens only with GR, @c-white knows better than me.

While we have not changed the restart format for a long time, we do not guarantee compatibility between versions.

In the initialization from a restart file, the code does not perform refinement (assuming that refinement is already processed before the end of the previous run). So if you change the refinement condition after restart, it should happen only after the first step.

Athena++ does not solve face-centered velocities as variables. They are just reconstructed from cell-centered variables in the reconstruction step before calling a Riemann Solver. So as long as cell centered variables are correctly refined, they should be fine. Note a restart file does not contain primitive variables for non-GR runs, and primitives are calculated from conservative during initialization. For GR runs, primitive variables are loaded from a restart file as GR requires initial guess for variable inversion.

During prolongation, either for a meshblock boundary between two refinement levels or for regional refinement with AMR, how is the guess for con-to-prim initialized? I think this is one key part of why I'm getting many failures of inversion and hence floor resets during the refinement process.

Also, somewhat related, I think I'm seeing conflicting use of cell center values. For instance, in kerr-schild.cpp in the current master branch, x1s2 = x1s3 = x1v(i) = 0.5 * (r_m + r_p), but in spherical_polar.cpp these values are represented by their actual analytic integrals. I'm trying to understand which cell center locations are used in 1) mesh_refinement.cpp for all the ProlongateXXXXX functions, in field.cpp for CellCenteredField(), and in the reconstruction routine (PPM in my case). When the field is refined in a prolongated AMR region for me, I see grid-scale oscillation that would fit with one of these being mis-matched. Also, in the case of strong field regions and low density (which is where I get most inversion failures) that could help explain what I'm seeing.

One last question, has limiting been tried in CalculateCellCenteredField in the interpolation of the field?

Most of the GR-related parts are written by @c-white, so I am not 100% sure. Very unfortunately, @c-white who wrote the GR part of Athena++ has left the field, so I need to ask you to help yourself.

From the code, AMR does not refine primitive variables, so default values (i.e. all zero) are used in the conservative to primitive conversion in newly refined blocks. So I guess you are right, it can be why you get failures after the refinement. To fix it, we should refine primitives as well. Maybe it can be easily done, by just adding

if (GENERAL_RELATIVITY)
  pmy_block->pmr->AddToRefinement(&w, &coarse_prim_);

around l.87 in hydro.cpp, but I am not sure if it will work. @felker, how do you think?

Regarding the cell-centered position, I guess spherical-polar like coordinates should use the volume averaged center position, but there might be some intention behind it. I am sorry but I simply do not know. If the solution is improved by changing it, I am happy to change it.

In the second-order scheme, Cell-centered field should reside somewhere between the left and right face fields, so there is no need for limiting. Maybe we need something for higher-order schemes, but our CT solver is only second-order at this moment.

SMR does not have this issue, so if you can run your simulation with SMR, perhaps it is a good workaround.