porting build from gnu to intel, runtime error: "Subscript #1 of the array DES has value 0 which is less than the lower bound of 1"

Question

porting build from gnu to intel, runtime error: "Subscript #1 of the array DES has value 0 which is less than the lower bound of 1"

Closed this issue 8 months ago · 5 comments

Short issue:

We are getting the runtime error:

forrtl: severe (408): fort: (3): Subscript #1 of the array DES has value 0 which is less than the lower bound of 1

In more detail:

We are using the latest version of shield, i.e. SHiELD_BUILD_VERSION="FV3-202204-public", FV3_VERSION="FV3-202210-public", FMS_VERSION="2022.04". We're running on an ubuntu 22.04 linux AWS ec2 instance, and have built/run SHiELD successfully for many months using OpenMPI/gfortran.

We are now switching our build over from OpenMPI/gfortran (MKMF_TEMPLATE=linux-ubuntu-trusty-gnu.mk) to IntelMPI/ifort (MKMF_TEMPLATE="intel.mk"). We are using intel version:

mpiifort for the Intel(R) MPI Library 2021.10 for Linux*
Copyright Intel Corporation.
ifort version 2021.10.0

Our build is based as closely as possible on this SHiELD_build repo. We're testing a 1-hour C96 simulation with our original OpenMPI/gfortran build, and it completes successfully (~300 seconds on 24 cores). With IntelMPI/ifort, the model builds successfully, but from the same experiment directory where the GNU build runs without error, the intel build gives the following error at runtime:

 ---------------------------------------------
NOTE from PE     0: READING FROM SST_restart DISABLED
 Before adi: W max =    1.573370      min =   -1.371867    
NOTE from PE     0: Performing adiabatic init   1 times
forrtl: severe (408): fort: (3): Subscript #1 of the array DES has value 0 which is less than the lower bound of 1

Image              PC                Routine            Line        Source             
shield_nh.prod.32  00000000015BCBBE  gfdl_mp_mod_mp_qs        7233  gfdl_mp.F90
shield_nh.prod.32  00000000015BD7D4  gfdl_mp_mod_mp_iq        7369  gfdl_mp.F90
shield_nh.prod.32  00000000015174C5  gfdl_mp_mod_mp_cl        4621  gfdl_mp.F90
shield_nh.prod.32  0000000001428F26  gfdl_mp_mod_mp_mp        1429  gfdl_mp.F90
shield_nh.prod.32  00000000015589F5  gfdl_mp_mod_mp_fa        5648  gfdl_mp.F90
shield_nh.prod.32  00000000018EB123  intermediate_phys         257  intermediate_phys.F90
libiomp5.so        000014B302363493  __kmp_invoke_micr     Unknown  Unknown
libiomp5.so        000014B3022D1CA4  __kmp_fork_call       Unknown  Unknown
libiomp5.so        000014B302289D23  __kmpc_fork_call      Unknown  Unknown
shield_nh.prod.32  00000000018C8D09  intermediate_phys         186  intermediate_phys.F90
shield_nh.prod.32  0000000000BE6BC0  fv_mapz_mod_mp_la         841  fv_mapz.F90
shield_nh.prod.32  00000000019FA0A1  fv_dynamics_mod_m         590  fv_dynamics.F90
shield_nh.prod.32  0000000002D31F23  atmosphere_mod_mp        1553  atmosphere.F90
shield_nh.prod.32  0000000002C61BFA  atmosphere_mod_mp         431  atmosphere.F90
shield_nh.prod.32  0000000002280A56  atmos_model_mod_m         395  atmos_model.F90
shield_nh.prod.32  0000000000EDE999  coupler_main_IP_c         417  coupler_main.F90
shield_nh.prod.32  0000000000ED93FF  MAIN__                    146  coupler_main.F90
shield_nh.prod.32  000000000041504D  Unknown               Unknown  Unknown
libc.so.6          000014B301E29D90  Unknown               Unknown  Unknown
libc.so.6          000014B301E29E40  __libc_start_main     Unknown  Unknown
shield_nh.prod.32  0000000000414F65  Unknown               Unknown  Unknown

For reference the traceback is pointing to intermediate_phys.F90, line 257:
https://github.com/NOAA-GFDL/GFDL_atmos_cubed_sphere/blob/d2e5bef344b64d6a10524479b3288717239fb2a2/model/intermediate_phys.F90#L257

! fast saturation adjustment
            call fast_sat_adj (abs (mdt), is, ie, kmp, km, hydrostatic, consv .gt. consv_min, &
                     adj_vmr (is:ie, kmp:km), te (is:ie, j, kmp:km), dte (is:ie), q (is:ie, j, kmp:km, sphum), &
                     q (is:ie, j, kmp:km, liq_wat), q (is:ie, j, kmp:km, rainwat), &
                     q (is:ie, j, kmp:km, ice_wat), q (is:ie, j, kmp:km, snowwat), &
                     q (is:ie, j, kmp:km, graupel), q (is:ie, j, kmp:km, cld_amt), &
                     q2 (is:ie, kmp:km), q3 (is:ie, kmp:km), hs (is:ie, j), &
                     dz (is:ie, kmp:km), pt (is:ie, j, kmp:km), delp (is:ie, j, kmp:km), &
#ifdef USE_COND
                     q_con (is:ie, j, kmp:km), &
#else
                     q_con (isd:, jsd, 1:), &
#endif
#ifdef MOIST_CAPPA
                     cappa (is:ie, j, kmp:km), &
#else
                     cappa (isd:, jsd, 1:), &
#endif
                     gsize, last_step, inline_mp%cond (is:ie, j), inline_mp%reevap (is:ie, j), &
                     inline_mp%dep (is:ie, j), inline_mp%sub (is:ie, j), do_sat_adj)

I checked our build logs, and we are using both USE_COND and MOIST_CAPPA, which are activated due to the 'nh' setting.

I noticed this is called from:
https://github.com/NOAA-GFDL/SHiELD_physics/blob/2882fdeb429abc2349a8e881803ac67b154532c3/simple_coupler/coupler_main.F90#L146C19-L146C19

 call fms_init()
 call mpp_init()
 initClock = mpp_clock_id( 'Initialization' )
 call mpp_clock_begin (initClock) !nesting problem

 call fms_init
 call constants_init
 call fms_affinity_init
 call sat_vapor_pres_init

 call coupler_init

As an additional piece of information, we have also generated our own control/coupler file, and do not have this runtime error with the intel build. In our case, we comment out fms_init and fms_affinity_init since fms_init is called here twice and fms_affinity_init was removed later in https://github.com/NOAA-GFDL/FMScoupler/blob/main/SHiELD/coupler_main.F90:

!   call fms_init(mpi_comm_fv3)
    if (dodebug) print *, "fv3_shield_cap:: calling constants_init..."
    call constants_init
!   if (dodebug) print *, "fv3_shield_cap:: calling fms_affinity_init..."
!   call fms_affinity_init

I've tried building the IntelMPI/ifort build in both a docker container and a bash script directly on the ec2 instance, and I've tried building in both 'prod' mode and 'debug' model, but all give the same error above.

I've tried removing "export FMS_CPPDEFS=-DHAVE_GETTID" from the build options - in that case the make FMS fails.

I found a similar issue report in E3SM due to an upgrade in the intel complier. In their case it was related to a bug, but I'm not sure if that is true here:
E3SM-Project/E3SM#2051

Have you seen this error before, and do you have any idea what might be causing it? I recall getting a similar error in Dec 2022 and I believe the FMS version was part of the problem, and it was resolved by upgrading FMS. However, the FMS versions are the same between builds in this case.

Answer 1 · 2023-10-13T15:31:49.000Z

Hi, Steve. Thank you for identifying this issue. We occasionally see a crash in this location (in gfdl_mp.F90 in one of the lookup tables) and in the past I have thought it was due to numerical instability; but if it is not present in the Gnu compiler it strongly suggests something different is going on. The routine is set up so that it can never round to an integer less than 1 no matter how low the temperature is, but apparently in intel it is either rounding down to 0 in temperatures near t_min, or it is getting a NaN and creating bad results.

Since it is crashing shortly after initialization, it may be worthwhile to turn on range_warn and fv_debug to help try to pinpoint at what step it is crashing. Could you try that and send along the output log?

Thanks,
Lucas

Answer 2 · 2023-10-14T10:43:02.000Z

Hi Lucas (@lharris4 ), I've run the intel build and a 10-minute integration of the gnu build, both with range_warn and fv_debug set to 'true'. I'm attaching the output from both.

run.out.atm-gfortran-24.txt
run.out.atm-intel-24.txt

Answer 3 · 2023-10-16T16:19:41.000Z

Hi, Steve. I see immediately that the gfortran run was compiled double-precision but the intel run is single-precision. I would think that by itself this would not cause a crash, but it could be a signal of some other underlying issue. The debug output doesn't suggest anything suspicious.

Answer 4 · 2023-11-28T02:12:01.000Z

Hi @lharris4, this runs ok for us with version 2023.05. We've tested builds using combinations of 32-bit, 64-bit, debug, prod, non-hydrostatic, hydrostatic. It seems to be running ok for us now.

Answer 5 · 2023-11-28T03:13:51.000Z

@StevePny thanks, glad to hear it works now. Hope to get to your other issues this week.