Unexpected test-suite failures: Subscript #1 of array FSDS_DIR has value 1 which is > than the upper bound of 0
slevis-lmwg opened this issue · 7 comments
ERP_D_Ld60.f19_g17.I1850Slim50RsGs.cheyenne_intel.clm-realistic_fromCLM5_1850Monthly
(previously ERP_D_Ld60.f19_g17.I1850Clm45BgcGs.cheyenne_intel.clm-realistic_fromCLM5_1850Monthly)
ERP_Ld60_D.f19_g17.I2000SlimRsGs.cheyenne_intel.clm-2000_CMIP6_AMIP_1deg_ensembleMonthly
In the cesm.log files I find the same error:
1079:MPT: 29 ../sysdeps/unix/sysv/linux/waitpid.c: No such file or directory.
1079:MPT: (gdb) #0 0x00002b51d335c7da in __waitpid (pid=pid@entry=43075,
1079:MPT: stat_loc=stat_loc@entry=0x7ffc14964bc0, options=options@entry=0)
1079:MPT: at ../sysdeps/unix/sysv/linux/waitpid.c:29
1079:MPT: #1 0x00002b51d41a3db6 in mpi_sgi_system (
1079:MPT: #2 MPI_SGI_stacktraceback (
1079:MPT: header=header@entry=0x7ffc14965380 "MPT ERROR: Rank 1079(g:1079) received signal SIGSEGV(11).\n\tProcess ID: 43068, Host: r8i5n19, Program: /glade/scratch/slevis/ERP_D_Ld60.f19_g17.I1850Clm45BgcGs.cheyenne_intel.clm-realistic_fromCLM5_185"...) at sig.c:340
1079:MPT: #3 0x00002b51d41a3fb2 in first_arriver_handler (signo=signo@entry=11,
1079:MPT: stack_trace_sem=stack_trace_sem@entry=0x2b51de800080) at sig.c:489
1079:MPT: #4 0x00002b51d41a434b in slave_sig_handler (signo=11,
1079:MPT: siginfo=<optimized out>, extra=<optimized out>) at sig.c:564
1079:MPT: #5 <signal handler called>
1079:MPT: #6 0x0000000000aecb3d in mml_mainmod::mml_main (bounds=..., atm2lnd_inst=...,
1079:MPT: lnd2atm_inst=...)
1079:MPT: at /glade/work/slevis/git_slim/cesm2_1_slim/src/main/mml_main.F90:482
The atm.log stops here:
(datm_comp_run) atm: model date 10101 84600s
(datm_comp_run) atm: model date 10102 0s
Same tests on izumi
ERP_D_Ld60.f19_g17.I1850Slim50RsGs.izumi_intel.clm-realistic_fromCLM5_1850Monthly
(previously ERP_D_Ld60.f19_g17.I1850Clm45BgcGs.izumi_intel.clm-realistic_fromCLM5_1850Monthly)
ERP_Ld60_D.f19_g17.I2000SlimRsGs.izumi_intel.clm-2000_CMIP6_AMIP_1deg_ensembleMonthly
confirms same line of code and provides information about a problematic array.
MML end of 1d restart vars
forrtl: severe (408): fort: (2): Subscript #1 of the array FSDS_DIR has value 1 which is greater than the upper bound of 0
Image PC Routine Line Source
cesm.exe 0000000001FA370F Unknown Unknown Unknown
cesm.exe 0000000000B12615 mml_mainmod_mp_mm 482 mml_main.F90
cesm.exe 0000000000899A42 clm_driver_mp_clm 144 clm_driver.F90
cesm.exe 00000000008688BC lnd_comp_mct_mp_l 476 lnd_comp_mct.F90
cesm.exe 0000000000475ECE component_mod_mp_ 728 component_mod.F90
cesm.exe 00000000004420EF cime_comp_mod_mp_ 2720 cime_comp_mod.F90
cesm.exe 000000000045D005 MAIN__ 125 cime_driver.F90
In #31 I'm seeing this as well, and I've started to add some SHR_ASSERT checking to diagnose what is going on. I'm not sure why this only happens with the ERP tests.
Using one of the above tests ERP_D_Ld60.f19_g17.I1850Slim50RsGs.cheyenne_intel.clm-realistic_fromCLM5_1850Monthly
, I isolated this problem:
Parts of mml_main are not in do g = begg, endg
loops. Data outside this range of g can cause the model to fail.
Adding loops got me past the original error. My test ERP_D_Ld60.f19_g17.I1850Slim50RsGs.cheyenne_intel.clm-realistic_fromCLM5_1850Monthly.20230112_171812_19sok8/run/case2run
now fails partway through case2. I repeated the test twice and confirmed that all three times it failed in case2 at timestep 1490 with this traceback in the cesm.log:
55:MPT ERROR: Rank 55(g:55) received signal SIGBUS(7).
55: Process ID: 63761, Host: r12i7n27, Program: /glade/scratch/slevis/ERP_D_Ld60.f19_g17.I1850Slim50RsGs.cheyenne_intel.clm-realistic_fromCLM5_1850Monthly.20230112_174031_gzump0/bld/case2bld/cesm.exe
55: MPT Version: HPE MPT 2.19 02/23/19 05:30:09
55:
55:MPT: --------stack traceback-------
55:MPT: Missing separate debuginfo for /glade/u/apps/ch/os/usr/lib64/libmlx5-rdmav2.so
55:MPT: Try: zypper install -C "debuginfo(build-id)=ba8002518966160a27c335e04ce8932989f69056"
55:MPT: (No debugging symbols found in /glade/u/apps/ch/os/usr/lib64/libmlx5-rdmav2.so)
55:MPT: 0x00002acca4cdc7da in waitpid () from /glade/u/apps/ch/os/lib64/libpthread.so.0
55:MPT: Missing separate debuginfos, use: zypper install glibc-debuginfo-2.22-100.27.3.x86_64
55:MPT: (gdb) #0 0x00002acca4cdc7da in waitpid ()
55:MPT: from /glade/u/apps/ch/os/lib64/libpthread.so.0
55:MPT: #1 0x00002acca5b23db6 in mpi_sgi_system (
55:MPT: #2 MPI_SGI_stacktraceback (
55:MPT: header=header@entry=0x7ffc66172100 "MPT ERROR: Rank 55(g:55) received signal SIGBUS(7).\n\tProcess ID: 63761, Host: r12i7n27, Program: /glade/scratch/slevis/ERP_D_Ld60.f19_g17.I1850Slim50RsGs.cheyenne_intel.clm-realistic_fromCLM5_1850Mont"...) at sig.c:340
55:MPT: #3 0x00002acca5b23fb2 in first_arriver_handler (signo=signo@entry=7,
55:MPT: stack_trace_sem=stack_trace_sem@entry=0x2accb0140080) at sig.c:489
55:MPT: #4 0x00002acca5b2434b in slave_sig_handler (signo=7, siginfo=<optimized out>,
55:MPT: extra=<optimized out>) at sig.c:564
55:MPT: #5 <signal handler called>
55:MPT: #6 0x000000000123f2b5 in m_attrvect::L_m_attrvect_mp_rcopy___2601__par_loop1_2_7 ()
55:MPT: at /glade/work/slevis/git_slim/SimpleLand/cime/src/externals/mct/mct/m_AttrVect.F90:2604
55:MPT: #7 0x00002acca43e6d13 in __kmp_invoke_microtask ()
55:MPT: from /glade/u/apps/opt/intel/2017u1/compilers_and_libraries/linux/lib/intel64/libiomp5.so
55:MPT: #8 0x00002acca43b6fad in __kmp_fork_call (loc=0x0, gtid=24,
55:MPT: call_context=(unknown: 4245048688), argc=-49714440,
55:MPT: microtask=0x2b0bfd0b2080, invoker=0x7ffc66172bbc, ap=0x7ffc66173410)
55:MPT: at ../../src/kmp_runtime.c:2003
55:MPT: #9 0x00002acca438f6f8 in __kmpc_fork_call (loc=0x0, argc=24,
55:MPT: microtask=0x2b0bfd064d70) at ../../src/kmp_csupport.c:339
55:MPT: #10 0x000000000123e7f9 in m_attrvect::rcopy_ (avin=..., avout=...,
55:MPT: vector=.FALSE., sharedindices=...)
55:MPT: at /glade/work/slevis/git_slim/SimpleLand/cime/src/externals/mct/mct/m_AttrVect.F90:2601
55:MPT: #11 0x0000000001248a23 in m_attrvect::copy_ (avin=..., avout=..., rlist=...,
55:MPT: trlist=<error reading variable: Cannot access memory at address 0x0>,
55:MPT: ilist=<error reading variable: Cannot access memory at address 0x0>,
55:MPT: tilist=..., vector=.FALSE., sharedindices=..., .tmp.RLIST.len_V$21aa=0,
55:MPT: .tmp.TRLIST.len_V$21ba=0, .tmp.ILIST.len_V$21ca=0,
55:MPT: .tmp.TILIST.len_V$21da=0)
55:MPT: at /glade/work/slevis/git_slim/SimpleLand/cime/src/externals/mct/mct/m_AttrVect.F90:3295
55:MPT: #12 0x00000000005321dc in prep_lnd_mod::prep_lnd_merge (a2x_l=..., r2x_l=...,
55:MPT: g2x_l=..., x2l_l=...)
55:MPT: at /glade/work/slevis/git_slim/SimpleLand/cime/src/drivers/mct/main/prep_lnd_mod.F90:355
55:MPT: #13 0x000000000052f9ea in prep_lnd_mod::prep_lnd_mrg (infodata=...,
55:MPT: timer_mrg=..., .tmp.TIMER_MRG.len_V$2476=18)
55:MPT: at /glade/work/slevis/git_slim/SimpleLand/cime/src/drivers/mct/main/prep_lnd_mod.F90:287
55:MPT: #14 0x000000000042fc69 in cime_comp_mod::cime_run ()
55:MPT: at /glade/work/slevis/git_slim/SimpleLand/cime/src/drivers/mct/main/cime_comp_mod.F90:2528
55:MPT: #15 0x0000000000449b39 in cime_driver ()
55:MPT: at /glade/work/slevis/git_slim/SimpleLand/cime/src/drivers/mct/main/cime_driver.F90:125
55:MPT: #16 0x00000000004084de in main ()
55:MPT: #17 0x00002acca5e10a35 in __libc_start_main ()
55:MPT: from /glade/u/apps/ch/os/lib64/libc.so.6
55:MPT: #18 0x00000000004083e9 in _start () at ../sysdeps/x86_64/start.S:118
55:MPT: (gdb) A debugging session is active.
55:MPT:
55:MPT: Inferior 1 [process 63761] will be detached.
55:MPT:
55:MPT: Quit anyway? (y or n) [answered Y; input not from terminal]
55:MPT: Detaching from program: /proc/63761/exe, process 63761
55:MPT: [Inferior 1 (process 63761) detached]
@ekluzek @fischer-ncar suggest (Stand-up 2023/1/13):
- run a comparison to baseline with current mods
- whether answers change or not, put the do g loop around everything in mml_main
- if answers are changing, generate new baselines and discuss with @marysa
- the new failure in mct (see prev post) can be tabled for now since we will soon upgrade to nuopc
- run a comparison to baseline with current mods
./create_test ERS_D_Ld60.f19_g16.H_MML_2000_CAM5.cheyenne_gnu.clm-global_uniform_g16_SOM -c /glade/p/cgd/tss/ctsm_baselines/slim-n14_cesm2.1.4
PASS... no diffs from baseline