NOAA-GFDL/MOM6

cgrid based experiments have a restart issue under Intel avx2

Opened this issue · 5 comments

When we compile MOM6-SIS2 with avx2 instruction sets , e.g. using compiler switch -march=core-avx2 instead of -xsse2 on c4 with Intel21, then MOM6_SIS2_cgrid does not reproduce across a restart (1x2days answer != 2x1day answer).
MOM6_SIS2 (bgrid) has no such issue.

I would guess that mom-ocean#1331 is also related to this issue.

If you were to do this with DEBUG=True in MOM_override and SIS_override, would the problem just go away, or could we use runtime parameter this to track down where in the code the problem arises?
(Note that the setting for DEBUG is not supposed to change answer at all, but that optimizing compilers can sometimes do unforeseen things.)

I tried with DEBUG=True and the restart issue persisted (and as you noted it did not change answers).
Here are the stdouts of the 1x2d and 2x1d run for MOM6_SIS2_cgrid

/lustre/f2/scratch/Niki.Zadeh/FMS2022.04_mom6_20220922_0/MOM6_SIS2_cgrid/ncrc5.intel22_avx2-prod/stdout/run/MOM6_SIS2_cgrid_1x0m2d_64x1o1.o134324168 
/lustre/f2/scratch/Niki.Zadeh/FMS2022.04_mom6_20220922_0/MOM6_SIS2_cgrid/ncrc5.intel22_avx2-prod/stdout/run/MOM6_SIS2_cgrid_2x0m1d_64x1o1.o134324169

I am never sure how to interpret these DEBUG printouts , but I think it indicates divergence at the first timestep after restart in forces%tau[xy]

1x2day

u-point: mean=  -4.1938460052178765E-04 min=  -5.7441462892834971E-01 max=   1.0057015592517695E+00 u Before steps forces%tau[xy]
u-point: c=   1695533 W=   1695533 u Before steps forces%tau[xy]
v-point: mean=  -4.7573751675404367E-03 min=  -5.7978682712943574E-01 max=   5.1548970973319852E-01 v Before steps forces%tau[xy]
v-point: c=   1690788 S=   1681810 v Before steps forces%tau[xy]
h-point: mean=   8.8362561189375114E+01 min=   0.0000000000000000E+00 max=   6.4088225739349673E+03 Before steps forces%p_surf
h-point: c=    219427 Before steps forces%p_surf
h-point: mean=   5.7385865633417070E-03 min=   0.0000000000000000E+00 max=   3.1249700161192415E-02 Before steps forces%ustar
h-point: c=   1724581 Before steps forces%ustar

2x1day

u-point: mean=  -4.1938460052178765E-04 min=  -5.7441462892834971E-01 max=   1.0057015592517695E+00 u Before steps forces%tau[xy]
u-point: c=   1695523 W=   1695523 u Before steps forces%tau[xy]
v-point: mean=  -4.7573751675404367E-03 min=  -5.7978682712943574E-01 max=   5.1548970973319852E-01 v Before steps forces%tau[xy]
v-point: c=   1690769 S=   1681763 v Before steps forces%tau[xy]
h-point: mean=   8.8362561189375114E+01 min=   0.0000000000000000E+00 max=   6.4088225739349673E+03 Before steps forces%p_surf
h-point: c=    219422 Before steps forces%p_surf
h-point: mean=   5.7385865633417070E-03 min=   0.0000000000000000E+00 max=   3.1249700161192415E-02 Before steps forces%ustar
h-point: c=   1724564 Before steps forces%ustar

You could try the following to diff the relevant portions of the stdouts

vimdiff /ncrc/home2/Niki.Zadeh/MOM6_SIS2_cgrid_1x.out /ncrc/home2/Niki.Zadeh/MOM6_SIS2_cgrid_2x.out

Looks like that checksum is here:

if (do_dyn) call MOM_mech_forcing_chksum("Before steps", forces, G, US, haloshift=0)

You could go inside MOM_mech_forcing_chksum for the details, but probably not necessary.

Usually I would keep adding uvchksum() calls to the earlier lines and check the output to fine the exact line where the difference happens. It can be very tedious, so if you have some intuition about where it happens, that can help.

I guess the first thing is to check if it's happening outside of step_MOM.

I just noticed, same restart issue happens under gcc compiler with -O3 (both gcc9 on c4 and gcc11 on c5)! -O2 does not have any restart issue.