NOAA-GFDL/MOM6

OMIPp25+BLING and CM4 crash with Overflow in reproducing_EFP_sum(_2d)

Closed this issue · 5 comments

After updating MOM6-examples from commit 40e3937 (on 20231130) to commit ab0c120 (on 20240321) the regression test experiment OMIP_CORE2 (which has BLING on) crashes as follows:

FATAL from PE   125: Overflow in reproducing_EFP_sum(_2d) conversion of   9.56361E+43                                  
                                                                                                                       
Image              PC                Routine            Line        Source                                              
fms_MOM6_SIS2_com  0000000001F4A9A7  mpp_mod_mp_mpp_er          72  mpp_util_mpi.inc                                    
fms_MOM6_SIS2_com  000000000097B4C8  mom_error_handler         191  MOM_error_handler.F90                              
fms_MOM6_SIS2_com  00000000009E8921  mom_coms_mp_repro         203  MOM_coms.F90                                        
fms_MOM6_SIS2_com  0000000000C66F23  mom_spatial_means         391  MOM_spatial_means.F90                              
fms_MOM6_SIS2_com  0000000000B1F851  mom_generic_trace         689  MOM_generic_tracer.F90                              
fms_MOM6_SIS2_com  00000000009EF10C  mom_tracer_flow_c         725  MOM_tracer_flow_control.F90                        
fms_MOM6_SIS2_com  000000000104B38B  mom_sum_output_mp         530  MOM_sum_output.F90                                  
fms_MOM6_SIS2_com  0000000000BACBF0  mom_mp_finish_mom        3431  MOM.F90                                            
fms_MOM6_SIS2_com  00000000009D057E  ocean_model_mod_m         572  ocean_model_MOM.F90                                
fms_MOM6_SIS2_com  000000000041499F  MAIN__                   1063  coupler_main.F90  

For some layouts, it crashes like:

Nan!
fms_MOM6_SIS2_com  0000000001F3D5FA  mpp_mod_mp_mpp_mi          32  mpp_reduce_mpi.fh                                   
fms_MOM6_SIS2_com  0000000000CE0EC9  mom_horizontal_re          86  MOM_horizontal_regridding.F90                       
fms_MOM6_SIS2_com  00000000010461D5  mom_tracer_initia         220  MOM_tracer_initialization_from_Z.F90                
fms_MOM6_SIS2_com  0000000000B27EA5  mom_generic_trace         354  MOM_generic_tracer.F90                              
fms_MOM6_SIS2_com  00000000009F09E0  mom_tracer_flow_c         343  MOM_tracer_flow_control.F90                         
fms_MOM6_SIS2_com  0000000000BBC705  mom_mp_initialize        3323  MOM.F90  

which comes from the "stop" statement in
https://github.com/NOAA-GFDL/MOM6/blob/dev/gfdl/src/framework/MOM_horizontal_regridding.F90#L74

Running in debug mode (-O0) gives division by 0 and the following traceback.

forrtl: error (73): floating divide by zero                                                                             
Image              PC                Routine            Line        Source                                              
libpthread-2.31.s  000014F967871910  Unknown               Unknown  Unknown                                             
fms_MOM6_SIS2_com  00000000019A5FDA  mom_remapping_mp_         754  MOM_remapping.F90                                   
fms_MOM6_SIS2_com  000000000198F533  mom_remapping_mp_         195  MOM_remapping.F90                                   
fms_MOM6_SIS2_com  0000000003423DDC  mom_ale_mp_ale_re        1335  MOM_ALE.F90                                         
fms_MOM6_SIS2_com  0000000001CCB578  mom_tracer_initia         204  MOM_tracer_initialization_from_Z.F90                
fms_MOM6_SIS2_com  0000000001B3728D  mom_generic_trace         354  MOM_generic_tracer.F90                              
fms_MOM6_SIS2_com  00000000020D3730  mom_tracer_flow_c         343  MOM_tracer_flow_control.F90                         
fms_MOM6_SIS2_com  0000000002016C03  mom_mp_initialize        3323  MOM.F90                                             
fms_MOM6_SIS2_com  0000000001A265B7  ocean_model_mod_m         278  ocean_model_MOM.F90      

The experiment runs fine when I turn off generic tracer BLING.

The OM4p25+BLING crash seems to happen after applying the following MOM6 commit (around February 1st 2023):
9a6ddee

Which makes sense since the crash happens only when BLING is turned on.

The crash is absent in the previous commit e7a7a82 .

I think the reason is the missing hSrc in computing the thickness. The source of the issue is here:

dz_neglect = set_dz_neglect(GV, US, remap_answer_date, dz_neglect_edge)

The possible solution should be:

if (h_is_in_Z_units) then
      dz_neglect = set_dz_neglect(GV, US, remap_answer_date, dz_neglect_edge)
      !added to compute the hSrc
      GV_loc = GV ; GV_loc%ke = kd
      call dz_to_thickness_simple(dzSrc, hSrc, G, GV_loc, US) 
      !finish adding
      call ALE_remap_scalar(remapCS, G, GV, kd, hSrc, tr_z, h, tr, all_cells=.false., answer_date=remap_answer_date, &
                            H_neglect=dz_neglect, H_neglect_edge=dz_neglect_edge)
    else
      ! Equation of state data is not available, so a simpler rescaling will have to suffice,
      ! but it might be problematic in non-Boussinesq mode.
      GV_loc = GV ; GV_loc%ke = kd
      call dz_to_thickness_simple(dzSrc, hSrc, G, GV_loc, US)
      call ALE_remap_scalar(remapCS, G, GV, kd, hSrc, tr_z, h, tr, all_cells=.false., answer_date=remap_answer_date )
    endif

Thank you for tracking down the source of this problem, @favorliao, with the use of the uninitialized hSrc array when h_is_in_Z_units == .true., which only occurs with the generic tracers. I agree with your diagnosis, but not the solution you propose.

The issue here is that when h_is_in_Z_units == .true., the thickness variable 'h' is being provided in depth units, not thickness units. The subroutine dz_to_thickness_simple() is converting vertical extents (in depth units) into thicknesses (in thickness units, but that is not what is needed here. Instead, the I think that the solution is to replace call ALE_remap_scalar(..., hSrc, ...) with call ALE_remap_scalar(..., dzSrc, ...) inside of the h_is_in_Z_units == .true. block. I have put in a pull request (#650) that I think should addresses this problem, and it is passing the usual MOM6 regression tests, but obviously this particular generic-tracer related bug is not detected with our usual tests, so I would appreciate it if you could evaluate whether my proposed bug-fix does actually address this problem, @nikizadehgfdl.

It has been verified that this issues was corrected when PR #650 was merged into dev/gfdl.