Broken diagnostics (T_adx, T_ady, S_adx, S_ady) in symmetric memory mode

Question

Broken diagnostics (T_adx, T_ady, S_adx, S_ady) in symmetric memory mode

Closed this issue 4 months ago · 8 comments

Diagnostics including T_adx, T_ady, S_adx, and S_ady give stripes of no values on CPU boundaries in symmetric memory mode (vertical stripes in X_adx and horizontal stripes in X_ady). The stripes are not present in non-symmetric memory. We have not yet confirmed if it impacts other diagnostics. It does not impact the prognostic model solutions. The stripes emerge in code versions from PR #564 and later.

Answer 1 · 2024-05-21T17:57:08.000Z

Not sure (haven't tried) but it looks like the shift calculation was dropped at https://github.com/NOAA-GFDL/MOM6/pull/564/files#diff-b4d8d593199025815b566920316560d920d617186e26b3464e2a8b2f17176ad9L388

Answer 2 · 2024-05-21T18:35:58.000Z

Apparently (slack thread) this is not apparent in other variables so the newer loop structure in MOM_diag_remap.F90 is probably OK. We'll have to do a DEBUG=True for before and after...

Answer 3 · 2024-05-23T14:52:34.000Z

It looks like it was this commit that introduced the striping: 6153d97
I reverted this change:
if (CS%usePPM) stencil = 3
back to this:
if (CS%usePPM .and. .not. CS%useHuynh) stencil = 3
and the striping went away. (So when the stencil is 2 there is no striping.)

Still unsure what the fix should be.

Answer 4 · 2024-05-23T19:59:56.000Z

I have now figured out that I can get rid of the striping with the newer version of the code with 2 changes:
A) increasing the halo size in OM4 from 4 to 6
or
B) Changing this line to isv = is-1-nsten_halo * stencil ; jsv = js-1-nsten_halo * stencil

(A works independent of B and B works independent of A)

I'm not sure if "B" is the correct patch for this yet, but just offering these as clues. I wonder if @Hallberg-NOAA or @kshedstrom might have thoughts?

Answer 5 · 2024-05-23T20:53:10.000Z

Making the halos or the work space wider is a good clue, but it will be a potentially expensive (and in the case of the wider halos unreliable) fix for this problem. I think that the underlying problem here is with how the do_i(i,j) is being declared, set and used. In fact, I think that you already identified this problem, @breichl , when you added the comment on Line 1069 of MOM_tracer_advect.F90, when you say:
! (The logical test could be "(do_i(i,j) .or. do_i(i+1,j))" to be clearer, but not needed)

To fix the problem, I suspect that the logic surrounding the calculation of the tracer advection diagnostics should be revised to do i=is,ie ; if (do_i(i,j) .or. do_i(i,j+1)) then, in both lines 1070 and 1078 and do I=is-1,ie ; if (do_i(i,j) .or. do_i(i+1,j)) then on lines 669 and 691.

In addition, I believe that the declarations for do_i on lines 383 and 759 should be changed from logical :: do_i(SZIB_(G), SZJ_(G)) to logical :: do_i(SZI_(G), SZJ_(G)) to reflect that these logical branches apply at tracer points.

In addition, I suspect that @marshallward might advise us that it would just be faster to avoid the logical tests around these diagnostics of net advective fluxes and exploit the fact that we don't change anything by adding in 0-fluxes. (We do need the logical tests for the tracer advection itself at about lines 653 and 1041, because even when the fluxes are 0, multiplying a tracer concentration by a thickness and then the reciprocal of that thickness changes the tracer concentration at roundoff.)

Answer 6 · 2024-05-23T23:23:57.000Z

Thanks @Hallberg-NOAA! I tested this again with the revised logic and indeed that seems to also fix the striping on its own. Pr is in #649

Another thing I noticed while debugging this is the inconsistency between the logic change at L127 with the existing logic at L396 and L768
In my configuration I can update these three lines to all have the updated logic and get the same results. It is separate from the present issue, but just wanted to point it out.

Answer 7 · 2024-05-24T07:03:21.000Z

@breichl, now that PR #649 has been merged into dev/gfdl, do you think that this issue can be closed, or would you like this to be kept open as a reminder to deal with the inconsistent logic you noted where the halo-sizes are being set?

Answer 8 · 2024-05-24T11:37:15.000Z

I'm fine to close it now, the bug that was impacting OM5 is now resolved. It seems we have a couple things to potentially revisit later in this code w/ the other being the logicals inside the loop setting the diagnostics.