cgrid based experiments have a restart issue under Intel avx2
Opened this issue · 5 comments
When we compile MOM6-SIS2 with avx2 instruction sets , e.g. using compiler switch -march=core-avx2
instead of -xsse2
on c4 with Intel21, then MOM6_SIS2_cgrid does not reproduce across a restart (1x2days answer != 2x1day answer).
MOM6_SIS2 (bgrid) has no such issue.
I would guess that mom-ocean#1331 is also related to this issue.
If you were to do this with DEBUG=True
in MOM_override
and SIS_override
, would the problem just go away, or could we use runtime parameter this to track down where in the code the problem arises?
(Note that the setting for DEBUG is not supposed to change answer at all, but that optimizing compilers can sometimes do unforeseen things.)
I tried with DEBUG=True and the restart issue persisted (and as you noted it did not change answers).
Here are the stdouts of the 1x2d and 2x1d run for MOM6_SIS2_cgrid
/lustre/f2/scratch/Niki.Zadeh/FMS2022.04_mom6_20220922_0/MOM6_SIS2_cgrid/ncrc5.intel22_avx2-prod/stdout/run/MOM6_SIS2_cgrid_1x0m2d_64x1o1.o134324168
/lustre/f2/scratch/Niki.Zadeh/FMS2022.04_mom6_20220922_0/MOM6_SIS2_cgrid/ncrc5.intel22_avx2-prod/stdout/run/MOM6_SIS2_cgrid_2x0m1d_64x1o1.o134324169
I am never sure how to interpret these DEBUG printouts , but I think it indicates divergence at the first timestep after restart in forces%tau[xy]
1x2day
u-point: mean= -4.1938460052178765E-04 min= -5.7441462892834971E-01 max= 1.0057015592517695E+00 u Before steps forces%tau[xy]
u-point: c= 1695533 W= 1695533 u Before steps forces%tau[xy]
v-point: mean= -4.7573751675404367E-03 min= -5.7978682712943574E-01 max= 5.1548970973319852E-01 v Before steps forces%tau[xy]
v-point: c= 1690788 S= 1681810 v Before steps forces%tau[xy]
h-point: mean= 8.8362561189375114E+01 min= 0.0000000000000000E+00 max= 6.4088225739349673E+03 Before steps forces%p_surf
h-point: c= 219427 Before steps forces%p_surf
h-point: mean= 5.7385865633417070E-03 min= 0.0000000000000000E+00 max= 3.1249700161192415E-02 Before steps forces%ustar
h-point: c= 1724581 Before steps forces%ustar
2x1day
u-point: mean= -4.1938460052178765E-04 min= -5.7441462892834971E-01 max= 1.0057015592517695E+00 u Before steps forces%tau[xy]
u-point: c= 1695523 W= 1695523 u Before steps forces%tau[xy]
v-point: mean= -4.7573751675404367E-03 min= -5.7978682712943574E-01 max= 5.1548970973319852E-01 v Before steps forces%tau[xy]
v-point: c= 1690769 S= 1681763 v Before steps forces%tau[xy]
h-point: mean= 8.8362561189375114E+01 min= 0.0000000000000000E+00 max= 6.4088225739349673E+03 Before steps forces%p_surf
h-point: c= 219422 Before steps forces%p_surf
h-point: mean= 5.7385865633417070E-03 min= 0.0000000000000000E+00 max= 3.1249700161192415E-02 Before steps forces%ustar
h-point: c= 1724564 Before steps forces%ustar
You could try the following to diff the relevant portions of the stdouts
vimdiff /ncrc/home2/Niki.Zadeh/MOM6_SIS2_cgrid_1x.out /ncrc/home2/Niki.Zadeh/MOM6_SIS2_cgrid_2x.out
Looks like that checksum is here:
Line 722 in d46de87
You could go inside MOM_mech_forcing_chksum
for the details, but probably not necessary.
Usually I would keep adding uvchksum()
calls to the earlier lines and check the output to fine the exact line where the difference happens. It can be very tedious, so if you have some intuition about where it happens, that can help.
I guess the first thing is to check if it's happening outside of step_MOM
.
I just noticed, same restart issue happens under gcc compiler with -O3 (both gcc9 on c4 and gcc11 on c5)! -O2 does not have any restart issue.