NOAA-GFDL/MOM6

Oblique OBCs have problematic restarts in some cases

Closed this issue · 23 comments

With oblique open boundary conditions, both u-face (east-west) and v-face (north-south) segments have both segment%rx_norm_obl and segment%ry_norm_obl fields. In order to have reproducing restarts, these are copied into the 3-d arrays OBC%rx_oblique and OBC%ry_oblique. Unfortunately, OBCs at both faces can have the same indices, so there could be cases where both the east-west and north-south segments are active with the same global indices, so these restart fields for OBC segments would be over-writing each other if active OBC segments join in the north-east corner of a tracer cell. A similar situation applies for the OBC%cff_normal fields that are also used for oblique OBCs.

The appropriate solution here would seem to be to use separate restart fields for the oblique OBCs for u- and v-face OBC segments. This would add 3 more restart fields when oblique OBCs are used, but then the staggering of all variables would be unambiguous and cases with pairs of oblique OBCs joining in the northeast corner of tracer cells would reproduce across restarts.

I do not have any examples where this situation arises, but it may be because the oblique OBC are not very widely used.

Note that normal radiation open boundary conditions do not have this problem.

@kshedstrom or @MJHarrison-GFDL, I would particularly appreciate your thoughts on whether this diagnosis seems correct.

Yes, that seems correct to me. I might have even known it when we implemented it, just accepting it as being better than not saving any of that stuff at all.

Northwest Atlantic (NWA) experiment setups of @andrew-c-ross use the oblique OBCs and they have restart issues that we have not been able to fix. Can I replace OBLIQUE with something else to just test if the restart issues are related to this?

OBC_SEGMENT_001 = "J=0,I=0:N,FLATHER,OBLIQUE,NUDGED" 

@nikizadehgfdl Yes, try ORLANSKI instead.

I may have a fix for this problem at github.com/Hallberg-NOAA/MOM6/tree/oblique_OBC_restart_fix, but because there were not any cases that exhibit this bug in the MOM6-examples test suite, I would appreciate it if someone else could test this code fix against a case that does exhibit the restart issues with oblique OBCs.

@nikizadehgfdl, or @andrew-c-ross, or @kshedstrom, would you be willing to test this for me? The restart file contents have changed, so the old restart files will not work. Instead I am just looking for a confirmation that two short runs with oblique OBCs give an equivalent answer to one longer run, with all of these run segments using the new code.

I can test this in the Bering domain, but will have to rerun it with OBLIQUE instead of ORLANSKI. Do you want confirmation that the unmodified code has a restart problem?

The dumbbell tests with OBLIQUE should be fine, having only EW OBCs (or NS if rotated).

I will test NWA and report back ASAP.

I'm having some trouble testing this because it also fails to reproduce when I switch to ORLANSKI. With Orlanski, a 2 day and 1day/restart/1day set of experiments fails with:

      Comparing ice_model.res.nc...
DIFFER : VARIABLE : flux_u : ATTRIBUTE : checksum : VALUES : "4D9242C5DB2789AB" <> "4E10805449F28086"
      Comparing input.nml...
WARNING : NONESSENTIAL FILES DIFFER : input.nml
      Comparing MOM.res_1.nc...
DIFFER : VARIABLE : diffu : ATTRIBUTE : checksum : VALUES : "B093914D35D8782B" <> "9E2943536525AFC8"
      Comparing MOM.res.nc...
DIFFER : VARIABLE : Temp : ATTRIBUTE : checksum : VALUES : "94D9EE69B4E1CCA4" <> "9157B520BEDF7165"

The same variables fail with OBLIQUE. With OBLIQUE and this new code, tres_y_001 (south boundary) also doesn't match; with the old code tres_y_002 (north boundary) doesn't match.

I tried setting the tracer reservoir length scale to 0 since that has historically been troublesome, but it didn't change the failures. I will try turning off some other options.

Comparing the restart files can tell you if runs are identical, but because differences in one field can spread quickly to all others, such a comparison does not help much in tracking down where in the code the differences emerge. To accomplish this, we use setting DEBUG=True, which writes out checksum values for many different variables at various points in each timestep. The first point where the output diverges is usually a good indicator of where in the code the differences arise. Moreover, it is straightforward to (temporarily?) add other checksum calls to the code to further narrow down where the differences arise.

(Be warned, though, that setting DEBUG=True can lead to a very large volume of output, so it is best carried out on a short run with all the output to stdout piped to a file.)

When debugging a run with restarts, it may be helpful to append the output of the second segment to the output from the first, after removing some of the initialization-related output from the second run. This can then be compared with the output of a single run that covers the entire period in the two run segments.

I can confirm not getting identical solutions with OBLIQUE with your branch. It differs from the get-go in:
u Beginning of step_MOM [uv]
but not v Beginning of step_MOM [uv]

The h fields match in the center, but not the corners, but that has always been the case, with differing nonsense values in the outside halo region.

Which test cases are demonstrating these inconsistencies (I am assuming the pan-Arctic and NWAtlantic configurations), and are there any relatively small test cases that would be good candidates for debugging these problems?

Also, just to narrow down what is going on, would it be possible to try short runs with these configurations with all of the OBC segments save one disabled, to determine whether this is a problem with individual OBC segments or whether this only arises with interactions between OBC segments? (Closing off OBC segments might lead to big adjustments, but hopefully the model will be able to hold together, perhaps with a shortened timestep, long enough to test the reproducibility of the restarts.)

Here's a diff using the NWA12 domain and Orlanski (FLATHER,ORLANSKI,NUDGED), and tracer reservoir lengths set to 0: https://gist.github.com/andrew-c-ross/d4bba834b26f222b70986d5ad3cba920

I think NWA12 can run without open boundaries without too many problems, so I should be able to run experiments with that. However, it may take me some time to do because of a possible biogeochemical issue I'm currently working on.

i was actually trying it with the Bering domain. All the Bering files are on gaea if you switch back to your JRA forcing. Sorry, I mostly run it from a January restart file, but I've started the transfer of 2011 files.

/lustre/f2/dev/gfdl/Katherine.Hedstrom/ESMG/ESMG-configs/Bering

What I saw last night in the debugger is that it somehow only copied over 3 of the 4 halo points at a tile boundary in u, I have no idea why and I only checked u.

Gaea is now set up to run Bering for real. I failed to run dueling debuggers there though. I can have a 300 core license, but not two 12 core licenses, it seems. The DEBUG output shows again that it fails to match from the very beginning of the restart run. Why would only 3 of the 4 halo points have non-zero values? Where is that set? Would that cause the checksums to disagree?

Turning off OBLIQUE_TAN changes how the runs diverge. Now they diverge at u pre-btstep accel PF[uv], both u and v, u not matching on the eastern side only, v on the northern side only.

The pre-btstep accel was a red herring. u and v match at the end of the first go through step_mom_dyn_split_rk2, diverge in the second call at u before vertvisc: up.

Turning on DEBUG_OBC causes the restart to match. The diff between with and without DEBUG_OBC shows up at:
h-point: c= 138662915 pre-btstep accel pbce
the second time around. DEBUG_OBC sets the value of h just outside the OBCs to zero, also sets the pressure gradients to zero right on the boundary. Could it be that something depends on the value of pbce outside?

@Hallberg-NOAA I made a PR to your branch to fix btstep. However, there's still a dependence on h outside the domain in MOM_mixed_layer_restrat.F90. You can compare with and without DEBUG_OBC to see. In the Bering, things
change at time 4200 s with h(:,5,1) being different everywhere after the call, not just outside the domain, not
just at the edges.

@kshedstrom, I merged in your PR and added some more changes to github.com/Hallberg-NOAA/MOM6/tree/oblique_OBC_restart_fix that I think address the changes to the Bering when DEBUG_OBC is enabled, but I would appreciate it if you would verify this in your version of the Bering test case. (I am using an older coupler, and had to drop some of the more recent coupler.nml options.)

Yes, this does pass the perfect restart test for OBLIQUE obcs. However, there is still work to be done on OBLIQUE_TAN.

Thanks for this update, @kshedstrom. After thoroughly inspecting the OBLIQUE_TAN related code and trying (and failing) to spot what might be going with the OBLIQUE_TAN OBCs, there are three questions that I think might be instructive:

  1. If we set OBC_RAD_VEL_WT = 1.0, do cases with OBLIQUE_TAN OBCs reproduce across restarts?

  2. Do the cases with OBLIQUE_TAN OBCs reproduce across different PE counts and layouts?

  3. Do you think that the improvements accruing from this branch as it stands are great enough to warrant a PR to dev/gfdl and hence to main at this time, or do you think that we should hold off until we have corrected all of the problems with OBLIQUE_TAN OBCs as well?

1., 2. Yes, with OBC_RAD_VEL_WT = 1.0, it does reproduce across restarts and across process counts.

  1. I don't see any reason to wait to push your success so far.

Sorry, my test does now reproduce with OBLIQUE_TAN. I must have done something wrong in that other test.

Thank you so much for that update, @kshedstrom. The fact that PR #219 corrects the restarts with OBLIQUE_TAN as well is a tremendous relief. With this update, I think that it is clear that this issue can be closed once PR #219 is merged into the dev/gfdl branch of the MOM6 code base.