202204 release - exited on signal 11 (Segmentation fault)
Closed this issue ยท 11 comments
Running on linux ubuntu gnu (docker container):
I'm still working through getting the 202204 release running successfully (i.e. to at least roughly replicate the pre-202204 version). I'm currently getting this segmentation fault. A previous segmentation fault was corrected by updating the FMS package build to the 'main' branch of the FMS repo. I've symbolic linked the aerosol.txt, solarconstant_noaa_an.txt, co2historicaldata_*.txt and a few other key files to the INPUT/ directory (from their previous location in the main experiment directory):
Updating solar constant with cycle approx
Opened solar constant data file: INPUT/solarconstant_noaa_an.txt
CHECK: Solar constant data used for year 2020 1361.0400000000000 1361.0400000000000
0 FORECAST DATE 26 AUG. 2020 AT 12 HRS 0.00 MINS
JULIAN DAY 2459088 PLUS 0.000000
RADIUS VECTOR 1.0104738
RIGHT ASCENSION OF SUN 10.3754267 HRS, OR 10 HRS 22 MINS 31.5 SECS
DECLINATION OF THE SUN 10.1408708 DEGS, OR 10 DEGS 8 MINS 27.1 SECS
EQUATION OF TIME -1.7063098 MINS, OR -102.38 SECS, OR-0.007466 RADIANS
SOLAR CONSTANT 1332.9711572 (DISTANCE AJUSTED)
for cosz calculations: nswr,deltim,deltsw,dtswh = 8 450.00000000000000 3600.0000000000000 1.0000000000000000 anginc,nstp = 3.2724923474893676E-002 9
Opened aerosol data file: INPUT/aerosol.dat
--- Reading MONTH OF AUGUST CLIMATOLOGICAL AEROSOL GLOBAL DISTRIBUTION
Request volcanic date out of range, optical depth set to lowest value
CHECK: Sample Volcanic data used for month, year: 8 2020
1 1 1 1
Opened co2 data file: INPUT/co2historicaldata_2020.txt
2020 MONTHLY CO2 (PPMV) 24 12 LON/LAT (N-S/0-360E) IN 15 DEGREE RESOLUTION, GLB ANNUAL MEAN = 412.81000000000000 GROWTH RATE = 2.5200000000000000
Global annual mean CO2 data for year 2020 4.1281000000000000E-004
CHECK: Sample of selected months of CO2 data used for year: 2020
Month = 1
4.1894999999999996E-004 4.1873000000000002E-004 4.1708999999999995E-004 4.1537999999999997E-004 4.1341000000000001E-004 4.1173000000000002E-004 4.1005000000000002E-004 4.0923000000000001E-004 4.0920999999999997E-004 4.0912999999999995E-004 4.0892000000000001E-004 4.0863000000000000E-004
Month = 4
4.2148000000000001E-004 4.1961000000000000E-004 4.1841000000000003E-004 4.1831999999999997E-004 4.1779000000000002E-004 4.1539999999999996E-004 4.1255999999999997E-004 4.1018000000000001E-004 4.1001999999999998E-004 4.0969999999999998E-004 4.0936999999999999E-004 4.0924000000000001E-004
Month = 7
4.0852999999999994E-004 4.0848000000000002E-004 4.0861000000000001E-004 4.0970999999999998E-004 4.1144000000000000E-004 4.1177999999999994E-004 4.1160999999999997E-004 4.1099999999999996E-004 4.1077999999999997E-004 4.1047000000000002E-004 4.1013999999999997E-004 4.1000999999999999E-004
Month = 10
4.1172000000000002E-004 4.1114999999999994E-004 4.1237999999999995E-004 4.1209999999999999E-004 4.1077999999999997E-004 4.1110000000000002E-004 4.1175999999999995E-004 4.1212999999999997E-004 4.1164999999999995E-004 4.1120999999999996E-004 4.1104999999999999E-004 4.1089999999999996E-004
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node e90980d4b77e exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
For verification, I've also tried running the regional_Laura test case, and get a similar error:
Updating solar constant with cycle approx
Opened solar constant data file: INPUT/solarconstant_noaa_an.txt
CHECK: Solar constant data used for year 2020 1361.0400000000000 1361.0400000000000
0 FORECAST DATE 26 AUG. 2020 AT 12 HRS 0.00 MINS
JULIAN DAY 2459088 PLUS 0.000000
RADIUS VECTOR 1.0104738
RIGHT ASCENSION OF SUN 10.3754267 HRS, OR 10 HRS 22 MINS 31.5 SECS
DECLINATION OF THE SUN 10.1408708 DEGS, OR 10 DEGS 8 MINS 27.1 SECS
EQUATION OF TIME -1.7063098 MINS, OR -102.38 SECS, OR-0.007466 RADIANS
SOLAR CONSTANT 1332.9711572 (DISTANCE AJUSTED)
for cosz calculations: nswr,deltim,deltsw,dtswh = 8 450.00000000000000 3600.0000000000000 1.0000000000000000 anginc,nstp = 3.2724923474893676E-002 9
Opened aerosol data file: INPUT/aerosol.dat
--- Reading MONTH OF AUGUST CLIMATOLOGICAL AEROSOL GLOBAL DISTRIBUTION
Request volcanic date out of range, optical depth set to lowest value
CHECK: Sample Volcanic data used for month, year: 8 2020
1 1 1 1
Opened co2 data file: INPUT/co2historicaldata_2020.txt
2020 MONTHLY CO2 (PPMV) 24 12 LON/LAT (N-S/0-360E) IN 15 DEGREE RESOLUTION, GLB ANNUAL MEAN = 412.81000000000000 GROWTH RATE = 2.5200000000000000
Global annual mean CO2 data for year 2020 4.1281000000000000E-004
CHECK: Sample of selected months of CO2 data used for year: 2020
Month = 1
4.1894999999999996E-004 4.1873000000000002E-004 4.1708999999999995E-004 4.1537999999999997E-004 4.1341000000000001E-004 4.1173000000000002E-004 4.1005000000000002E-004 4.0923000000000001E-004 4.0920999999999997E-004 4.0912999999999995E-004 4.0892000000000001E-004 4.0863000000000000E-004
Month = 4
4.2148000000000001E-004 4.1961000000000000E-004 4.1841000000000003E-004 4.1831999999999997E-004 4.1779000000000002E-004 4.1539999999999996E-004 4.1255999999999997E-004 4.1018000000000001E-004 4.1001999999999998E-004 4.0969999999999998E-004 4.0936999999999999E-004 4.0924000000000001E-004
Month = 7
4.0852999999999994E-004 4.0848000000000002E-004 4.0861000000000001E-004 4.0970999999999998E-004 4.1144000000000000E-004 4.1177999999999994E-004 4.1160999999999997E-004 4.1099999999999996E-004 4.1077999999999997E-004 4.1047000000000002E-004 4.1013999999999997E-004 4.1000999999999999E-004
Month = 10
4.1172000000000002E-004 4.1114999999999994E-004 4.1237999999999995E-004 4.1209999999999999E-004 4.1077999999999997E-004 4.1110000000000002E-004 4.1175999999999995E-004 4.1212999999999997E-004 4.1164999999999995E-004 4.1120999999999996E-004 4.1104999999999999E-004 4.1089999999999996E-004
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node d312f888f66b exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
@StevePny What test are you running that results in this seg fault failure? Is it one of the tests in the CI directory?
@StevePny To expand upon my previous comment: Is it one of the tests in the SHiELD_build repository CI directory?
@laurenchilutti
The latter case is the regional_Laura case distributed with the SHiELD-in-a-box:
https://www.gfdl.noaa.gov/shield/shield-in-a-box/
https://zenodo.org/record/5090124/files/regional_Laura.zip
I was able to install SHiELD and run these example cases (regional_Laura and global_nest_Laura) prior to the 202204 release.
@StevePny
I have tested the latest SHiELD code with the regional Laura case. When I built SHiELD natively on an NOAA HPC, the Laura case works fine. However, it does not work with the containerized SHiELD, which is very strange.
Looks like it is the NCEP library causing the crash. Segmentation fault occurs at
SHiELD_physics/gsmphys/sfcsub.F
Line 2757 in 8c46d4f
However, I still don't understand why it is the case. Before this line, another NCEP library, getgbh(), works just fine. Also, the same compiler flags and arguments worked previously.
@kaiyuan-cheng just checking in - has any progress been made on clearing up this issue, or should we continue with the pre-202204 version?
@StevePny It turns out that the default stack size, 8 MB, is insufficient to hold the large one-dimension variable, lbms. The solution is to set an unlimited stack size.
To provide a clarifying detail -
The docker container does not inherit the system stack limit by default. The ulimit can be set on the command line when running the docker container, but 'unlimited' is not a permitted option. In order to specify an unlimited stack size in the docker container, one can add this option:
--ulimit stack=-1
With this setting I can run the regional_Laura_test case on an AWS c6g.8xlarge ec2 instance.
Note - to be safe, I also set the stack size in the ec2 instance with:
ulimit -s unlimited