OMIPp25+BLING and CM4 crash with Overflow in reproducing_EFP_sum(_2d)
Closed this issue · 5 comments
After updating MOM6-examples from commit 40e3937 (on 20231130) to commit ab0c120 (on 20240321) the regression test experiment OMIP_CORE2 (which has BLING on) crashes as follows:
FATAL from PE 125: Overflow in reproducing_EFP_sum(_2d) conversion of 9.56361E+43
Image PC Routine Line Source
fms_MOM6_SIS2_com 0000000001F4A9A7 mpp_mod_mp_mpp_er 72 mpp_util_mpi.inc
fms_MOM6_SIS2_com 000000000097B4C8 mom_error_handler 191 MOM_error_handler.F90
fms_MOM6_SIS2_com 00000000009E8921 mom_coms_mp_repro 203 MOM_coms.F90
fms_MOM6_SIS2_com 0000000000C66F23 mom_spatial_means 391 MOM_spatial_means.F90
fms_MOM6_SIS2_com 0000000000B1F851 mom_generic_trace 689 MOM_generic_tracer.F90
fms_MOM6_SIS2_com 00000000009EF10C mom_tracer_flow_c 725 MOM_tracer_flow_control.F90
fms_MOM6_SIS2_com 000000000104B38B mom_sum_output_mp 530 MOM_sum_output.F90
fms_MOM6_SIS2_com 0000000000BACBF0 mom_mp_finish_mom 3431 MOM.F90
fms_MOM6_SIS2_com 00000000009D057E ocean_model_mod_m 572 ocean_model_MOM.F90
fms_MOM6_SIS2_com 000000000041499F MAIN__ 1063 coupler_main.F90
For some layouts, it crashes like:
Nan!
fms_MOM6_SIS2_com 0000000001F3D5FA mpp_mod_mp_mpp_mi 32 mpp_reduce_mpi.fh
fms_MOM6_SIS2_com 0000000000CE0EC9 mom_horizontal_re 86 MOM_horizontal_regridding.F90
fms_MOM6_SIS2_com 00000000010461D5 mom_tracer_initia 220 MOM_tracer_initialization_from_Z.F90
fms_MOM6_SIS2_com 0000000000B27EA5 mom_generic_trace 354 MOM_generic_tracer.F90
fms_MOM6_SIS2_com 00000000009F09E0 mom_tracer_flow_c 343 MOM_tracer_flow_control.F90
fms_MOM6_SIS2_com 0000000000BBC705 mom_mp_initialize 3323 MOM.F90
which comes from the "stop" statement in
https://github.com/NOAA-GFDL/MOM6/blob/dev/gfdl/src/framework/MOM_horizontal_regridding.F90#L74
Running in debug mode (-O0) gives division by 0 and the following traceback.
forrtl: error (73): floating divide by zero
Image PC Routine Line Source
libpthread-2.31.s 000014F967871910 Unknown Unknown Unknown
fms_MOM6_SIS2_com 00000000019A5FDA mom_remapping_mp_ 754 MOM_remapping.F90
fms_MOM6_SIS2_com 000000000198F533 mom_remapping_mp_ 195 MOM_remapping.F90
fms_MOM6_SIS2_com 0000000003423DDC mom_ale_mp_ale_re 1335 MOM_ALE.F90
fms_MOM6_SIS2_com 0000000001CCB578 mom_tracer_initia 204 MOM_tracer_initialization_from_Z.F90
fms_MOM6_SIS2_com 0000000001B3728D mom_generic_trace 354 MOM_generic_tracer.F90
fms_MOM6_SIS2_com 00000000020D3730 mom_tracer_flow_c 343 MOM_tracer_flow_control.F90
fms_MOM6_SIS2_com 0000000002016C03 mom_mp_initialize 3323 MOM.F90
fms_MOM6_SIS2_com 0000000001A265B7 ocean_model_mod_m 278 ocean_model_MOM.F90
The experiment runs fine when I turn off generic tracer BLING.
I think the reason is the missing hSrc in computing the thickness. The source of the issue is here:
The possible solution should be:
if (h_is_in_Z_units) then
dz_neglect = set_dz_neglect(GV, US, remap_answer_date, dz_neglect_edge)
!added to compute the hSrc
GV_loc = GV ; GV_loc%ke = kd
call dz_to_thickness_simple(dzSrc, hSrc, G, GV_loc, US)
!finish adding
call ALE_remap_scalar(remapCS, G, GV, kd, hSrc, tr_z, h, tr, all_cells=.false., answer_date=remap_answer_date, &
H_neglect=dz_neglect, H_neglect_edge=dz_neglect_edge)
else
! Equation of state data is not available, so a simpler rescaling will have to suffice,
! but it might be problematic in non-Boussinesq mode.
GV_loc = GV ; GV_loc%ke = kd
call dz_to_thickness_simple(dzSrc, hSrc, G, GV_loc, US)
call ALE_remap_scalar(remapCS, G, GV, kd, hSrc, tr_z, h, tr, all_cells=.false., answer_date=remap_answer_date )
endif
Thank you for tracking down the source of this problem, @favorliao, with the use of the uninitialized hSrc
array when h_is_in_Z_units == .true.
, which only occurs with the generic tracers. I agree with your diagnosis, but not the solution you propose.
The issue here is that when h_is_in_Z_units == .true.
, the thickness variable 'h' is being provided in depth units, not thickness units. The subroutine dz_to_thickness_simple()
is converting vertical extents (in depth units) into thicknesses (in thickness units, but that is not what is needed here. Instead, the I think that the solution is to replace call ALE_remap_scalar(..., hSrc, ...)
with call ALE_remap_scalar(..., dzSrc, ...)
inside of the h_is_in_Z_units == .true.
block. I have put in a pull request (#650) that I think should addresses this problem, and it is passing the usual MOM6 regression tests, but obviously this particular generic-tracer related bug is not detected with our usual tests, so I would appreciate it if you could evaluate whether my proposed bug-fix does actually address this problem, @nikizadehgfdl.
It has been verified that this issues was corrected when PR #650 was merged into dev/gfdl.