porting build from gnu to intel, runtime error: "Subscript #1 of the array DES has value 0 which is less than the lower bound of 1"
Closed this issue · 5 comments
Short issue:
We are getting the runtime error:
forrtl: severe (408): fort: (3): Subscript #1 of the array DES has value 0 which is less than the lower bound of 1
In more detail:
We are using the latest version of shield, i.e. SHiELD_BUILD_VERSION="FV3-202204-public", FV3_VERSION="FV3-202210-public", FMS_VERSION="2022.04". We're running on an ubuntu 22.04 linux AWS ec2 instance, and have built/run SHiELD successfully for many months using OpenMPI/gfortran.
We are now switching our build over from OpenMPI/gfortran (MKMF_TEMPLATE=linux-ubuntu-trusty-gnu.mk) to IntelMPI/ifort (MKMF_TEMPLATE="intel.mk"). We are using intel version:
mpiifort for the Intel(R) MPI Library 2021.10 for Linux*
Copyright Intel Corporation.
ifort version 2021.10.0
Our build is based as closely as possible on this SHiELD_build repo. We're testing a 1-hour C96 simulation with our original OpenMPI/gfortran build, and it completes successfully (~300 seconds on 24 cores). With IntelMPI/ifort, the model builds successfully, but from the same experiment directory where the GNU build runs without error, the intel build gives the following error at runtime:
---------------------------------------------
NOTE from PE 0: READING FROM SST_restart DISABLED
Before adi: W max = 1.573370 min = -1.371867
NOTE from PE 0: Performing adiabatic init 1 times
forrtl: severe (408): fort: (3): Subscript #1 of the array DES has value 0 which is less than the lower bound of 1
Image PC Routine Line Source
shield_nh.prod.32 00000000015BCBBE gfdl_mp_mod_mp_qs 7233 gfdl_mp.F90
shield_nh.prod.32 00000000015BD7D4 gfdl_mp_mod_mp_iq 7369 gfdl_mp.F90
shield_nh.prod.32 00000000015174C5 gfdl_mp_mod_mp_cl 4621 gfdl_mp.F90
shield_nh.prod.32 0000000001428F26 gfdl_mp_mod_mp_mp 1429 gfdl_mp.F90
shield_nh.prod.32 00000000015589F5 gfdl_mp_mod_mp_fa 5648 gfdl_mp.F90
shield_nh.prod.32 00000000018EB123 intermediate_phys 257 intermediate_phys.F90
libiomp5.so 000014B302363493 __kmp_invoke_micr Unknown Unknown
libiomp5.so 000014B3022D1CA4 __kmp_fork_call Unknown Unknown
libiomp5.so 000014B302289D23 __kmpc_fork_call Unknown Unknown
shield_nh.prod.32 00000000018C8D09 intermediate_phys 186 intermediate_phys.F90
shield_nh.prod.32 0000000000BE6BC0 fv_mapz_mod_mp_la 841 fv_mapz.F90
shield_nh.prod.32 00000000019FA0A1 fv_dynamics_mod_m 590 fv_dynamics.F90
shield_nh.prod.32 0000000002D31F23 atmosphere_mod_mp 1553 atmosphere.F90
shield_nh.prod.32 0000000002C61BFA atmosphere_mod_mp 431 atmosphere.F90
shield_nh.prod.32 0000000002280A56 atmos_model_mod_m 395 atmos_model.F90
shield_nh.prod.32 0000000000EDE999 coupler_main_IP_c 417 coupler_main.F90
shield_nh.prod.32 0000000000ED93FF MAIN__ 146 coupler_main.F90
shield_nh.prod.32 000000000041504D Unknown Unknown Unknown
libc.so.6 000014B301E29D90 Unknown Unknown Unknown
libc.so.6 000014B301E29E40 __libc_start_main Unknown Unknown
shield_nh.prod.32 0000000000414F65 Unknown Unknown Unknown
For reference the traceback is pointing to intermediate_phys.F90, line 257:
https://github.com/NOAA-GFDL/GFDL_atmos_cubed_sphere/blob/d2e5bef344b64d6a10524479b3288717239fb2a2/model/intermediate_phys.F90#L257
! fast saturation adjustment
call fast_sat_adj (abs (mdt), is, ie, kmp, km, hydrostatic, consv .gt. consv_min, &
adj_vmr (is:ie, kmp:km), te (is:ie, j, kmp:km), dte (is:ie), q (is:ie, j, kmp:km, sphum), &
q (is:ie, j, kmp:km, liq_wat), q (is:ie, j, kmp:km, rainwat), &
q (is:ie, j, kmp:km, ice_wat), q (is:ie, j, kmp:km, snowwat), &
q (is:ie, j, kmp:km, graupel), q (is:ie, j, kmp:km, cld_amt), &
q2 (is:ie, kmp:km), q3 (is:ie, kmp:km), hs (is:ie, j), &
dz (is:ie, kmp:km), pt (is:ie, j, kmp:km), delp (is:ie, j, kmp:km), &
#ifdef USE_COND
q_con (is:ie, j, kmp:km), &
#else
q_con (isd:, jsd, 1:), &
#endif
#ifdef MOIST_CAPPA
cappa (is:ie, j, kmp:km), &
#else
cappa (isd:, jsd, 1:), &
#endif
gsize, last_step, inline_mp%cond (is:ie, j), inline_mp%reevap (is:ie, j), &
inline_mp%dep (is:ie, j), inline_mp%sub (is:ie, j), do_sat_adj)
I checked our build logs, and we are using both USE_COND and MOIST_CAPPA, which are activated due to the 'nh' setting.
I noticed this is called from:
https://github.com/NOAA-GFDL/SHiELD_physics/blob/2882fdeb429abc2349a8e881803ac67b154532c3/simple_coupler/coupler_main.F90#L146C19-L146C19
call fms_init()
call mpp_init()
initClock = mpp_clock_id( 'Initialization' )
call mpp_clock_begin (initClock) !nesting problem
call fms_init
call constants_init
call fms_affinity_init
call sat_vapor_pres_init
call coupler_init
As an additional piece of information, we have also generated our own control/coupler file, and do not have this runtime error with the intel build. In our case, we comment out fms_init and fms_affinity_init since fms_init is called here twice and fms_affinity_init was removed later in https://github.com/NOAA-GFDL/FMScoupler/blob/main/SHiELD/coupler_main.F90:
! call fms_init(mpi_comm_fv3)
if (dodebug) print *, "fv3_shield_cap:: calling constants_init..."
call constants_init
! if (dodebug) print *, "fv3_shield_cap:: calling fms_affinity_init..."
! call fms_affinity_init
I've tried building the IntelMPI/ifort build in both a docker container and a bash script directly on the ec2 instance, and I've tried building in both 'prod' mode and 'debug' model, but all give the same error above.
I've tried removing "export FMS_CPPDEFS=-DHAVE_GETTID" from the build options - in that case the make FMS fails.
I found a similar issue report in E3SM due to an upgrade in the intel complier. In their case it was related to a bug, but I'm not sure if that is true here:
E3SM-Project/E3SM#2051
Have you seen this error before, and do you have any idea what might be causing it? I recall getting a similar error in Dec 2022 and I believe the FMS version was part of the problem, and it was resolved by upgrading FMS. However, the FMS versions are the same between builds in this case.
Hi, Steve. Thank you for identifying this issue. We occasionally see a crash in this location (in gfdl_mp.F90 in one of the lookup tables) and in the past I have thought it was due to numerical instability; but if it is not present in the Gnu compiler it strongly suggests something different is going on. The routine is set up so that it can never round to an integer less than 1 no matter how low the temperature is, but apparently in intel it is either rounding down to 0 in temperatures near t_min
, or it is getting a NaN and creating bad results.
Since it is crashing shortly after initialization, it may be worthwhile to turn on range_warn
and fv_debug
to help try to pinpoint at what step it is crashing. Could you try that and send along the output log?
Thanks,
Lucas
Hi Lucas (@lharris4 ), I've run the intel build and a 10-minute integration of the gnu build, both with range_warn and fv_debug set to 'true'. I'm attaching the output from both.
Hi, Steve. I see immediately that the gfortran run was compiled double-precision but the intel run is single-precision. I would think that by itself this would not cause a crash, but it could be a signal of some other underlying issue. The debug output doesn't suggest anything suspicious.