NOAA-GFDL/SHiELD_physics

202204 release - exited on signal 11 (Segmentation fault)

Closed this issue ยท 11 comments

Running on linux ubuntu gnu (docker container):

I'm still working through getting the 202204 release running successfully (i.e. to at least roughly replicate the pre-202204 version). I'm currently getting this segmentation fault. A previous segmentation fault was corrected by updating the FMS package build to the 'main' branch of the FMS repo. I've symbolic linked the aerosol.txt, solarconstant_noaa_an.txt, co2historicaldata_*.txt and a few other key files to the INPUT/ directory (from their previous location in the main experiment directory):

  Updating solar constant with cycle approx
    Opened solar constant data file: INPUT/solarconstant_noaa_an.txt 
  CHECK: Solar constant data used for year        2020   1361.0400000000000        1361.0400000000000     
0 FORECAST DATE          26 AUG.  2020 AT 12 HRS  0.00 MINS
  JULIAN DAY             2459088  PLUS   0.000000
  RADIUS VECTOR          1.0104738
  RIGHT ASCENSION OF SUN  10.3754267 HRS, OR  10 HRS  22 MINS  31.5 SECS
  DECLINATION OF THE SUN  10.1408708 DEGS, OR   10 DEGS   8 MINS  27.1 SECS
  EQUATION OF TIME        -1.7063098 MINS, OR   -102.38 SECS, OR-0.007466 RADIANS
  SOLAR CONSTANT        1332.9711572 (DISTANCE AJUSTED)


    for cosz calculations: nswr,deltim,deltsw,dtswh =           8   450.00000000000000        3600.0000000000000        1.0000000000000000        anginc,nstp =   3.2724923474893676E-002           9
    Opened aerosol data file: INPUT/aerosol.dat               
   --- Reading  MONTH OF AUGUST    CLIMATOLOGICAL AEROSOL GLOBAL DISTRIBUTION                  
    Request volcanic date out of range, optical depth set to lowest value
  CHECK: Sample Volcanic data used for month, year:           8        2020
           1           1           1           1
    Opened co2 data file: INPUT/co2historicaldata_2020.txt
        2020  MONTHLY CO2 (PPMV)   24  12  LON/LAT (N-S/0-360E) IN 15 DEGREE RESOLUTION,  GLB ANNUAL MEAN =   412.81000000000000        GROWTH RATE =   2.5200000000000000     
    Global annual mean CO2 data for year        2020   4.1281000000000000E-004
  CHECK: Sample of selected months of CO2 data used for year:        2020
         Month =           1
   4.1894999999999996E-004   4.1873000000000002E-004   4.1708999999999995E-004   4.1537999999999997E-004   4.1341000000000001E-004   4.1173000000000002E-004   4.1005000000000002E-004   4.0923000000000001E-004   4.0920999999999997E-004   4.0912999999999995E-004   4.0892000000000001E-004   4.0863000000000000E-004
         Month =           4
   4.2148000000000001E-004   4.1961000000000000E-004   4.1841000000000003E-004   4.1831999999999997E-004   4.1779000000000002E-004   4.1539999999999996E-004   4.1255999999999997E-004   4.1018000000000001E-004   4.1001999999999998E-004   4.0969999999999998E-004   4.0936999999999999E-004   4.0924000000000001E-004
         Month =           7
   4.0852999999999994E-004   4.0848000000000002E-004   4.0861000000000001E-004   4.0970999999999998E-004   4.1144000000000000E-004   4.1177999999999994E-004   4.1160999999999997E-004   4.1099999999999996E-004   4.1077999999999997E-004   4.1047000000000002E-004   4.1013999999999997E-004   4.1000999999999999E-004
         Month =          10
   4.1172000000000002E-004   4.1114999999999994E-004   4.1237999999999995E-004   4.1209999999999999E-004   4.1077999999999997E-004   4.1110000000000002E-004   4.1175999999999995E-004   4.1212999999999997E-004   4.1164999999999995E-004   4.1120999999999996E-004   4.1104999999999999E-004   4.1089999999999996E-004
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node e90980d4b77e exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

For verification, I've also tried running the regional_Laura test case, and get a similar error:

   Updating solar constant with cycle approx
    Opened solar constant data file: INPUT/solarconstant_noaa_an.txt 
  CHECK: Solar constant data used for year        2020   1361.0400000000000        1361.0400000000000     
0 FORECAST DATE          26 AUG.  2020 AT 12 HRS  0.00 MINS
  JULIAN DAY             2459088  PLUS   0.000000
  RADIUS VECTOR          1.0104738
  RIGHT ASCENSION OF SUN  10.3754267 HRS, OR  10 HRS  22 MINS  31.5 SECS
  DECLINATION OF THE SUN  10.1408708 DEGS, OR   10 DEGS   8 MINS  27.1 SECS
  EQUATION OF TIME        -1.7063098 MINS, OR   -102.38 SECS, OR-0.007466 RADIANS
  SOLAR CONSTANT        1332.9711572 (DISTANCE AJUSTED)


    for cosz calculations: nswr,deltim,deltsw,dtswh =           8   450.00000000000000        3600.0000000000000        1.0000000000000000        anginc,nstp =   3.2724923474893676E-002           9
    Opened aerosol data file: INPUT/aerosol.dat               
   --- Reading  MONTH OF AUGUST    CLIMATOLOGICAL AEROSOL GLOBAL DISTRIBUTION                  
    Request volcanic date out of range, optical depth set to lowest value
  CHECK: Sample Volcanic data used for month, year:           8        2020
           1           1           1           1
    Opened co2 data file: INPUT/co2historicaldata_2020.txt
        2020  MONTHLY CO2 (PPMV)   24  12  LON/LAT (N-S/0-360E) IN 15 DEGREE RESOLUTION,  GLB ANNUAL MEAN =   412.81000000000000        GROWTH RATE =   2.5200000000000000     
    Global annual mean CO2 data for year        2020   4.1281000000000000E-004
  CHECK: Sample of selected months of CO2 data used for year:        2020
         Month =           1
   4.1894999999999996E-004   4.1873000000000002E-004   4.1708999999999995E-004   4.1537999999999997E-004   4.1341000000000001E-004   4.1173000000000002E-004   4.1005000000000002E-004   4.0923000000000001E-004   4.0920999999999997E-004   4.0912999999999995E-004   4.0892000000000001E-004   4.0863000000000000E-004
         Month =           4
   4.2148000000000001E-004   4.1961000000000000E-004   4.1841000000000003E-004   4.1831999999999997E-004   4.1779000000000002E-004   4.1539999999999996E-004   4.1255999999999997E-004   4.1018000000000001E-004   4.1001999999999998E-004   4.0969999999999998E-004   4.0936999999999999E-004   4.0924000000000001E-004
         Month =           7
   4.0852999999999994E-004   4.0848000000000002E-004   4.0861000000000001E-004   4.0970999999999998E-004   4.1144000000000000E-004   4.1177999999999994E-004   4.1160999999999997E-004   4.1099999999999996E-004   4.1077999999999997E-004   4.1047000000000002E-004   4.1013999999999997E-004   4.1000999999999999E-004
         Month =          10
   4.1172000000000002E-004   4.1114999999999994E-004   4.1237999999999995E-004   4.1209999999999999E-004   4.1077999999999997E-004   4.1110000000000002E-004   4.1175999999999995E-004   4.1212999999999997E-004   4.1164999999999995E-004   4.1120999999999996E-004   4.1104999999999999E-004   4.1089999999999996E-004
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node d312f888f66b exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

@StevePny What test are you running that results in this seg fault failure? Is it one of the tests in the CI directory?

@StevePny To expand upon my previous comment: Is it one of the tests in the SHiELD_build repository CI directory?

@laurenchilutti
The latter case is the regional_Laura case distributed with the SHiELD-in-a-box:
https://www.gfdl.noaa.gov/shield/shield-in-a-box/
https://zenodo.org/record/5090124/files/regional_Laura.zip

I was able to install SHiELD and run these example cases (regional_Laura and global_nest_Laura) prior to the 202204 release.

@StevePny
I have tested the latest SHiELD code with the regional Laura case. When I built SHiELD natively on an NOAA HPC, the Laura case works fine. However, it does not work with the containerized SHiELD, which is very strange.

Looks like it is the NCEP library causing the crash. Segmentation fault occurs at

call getgb(lugb,lugi,kdata,lskip,jpds,jgds,ndata,lskip,

However, I still don't understand why it is the case. Before this line, another NCEP library, getgbh(), works just fine. Also, the same compiler flags and arguments worked previously.

@kaiyuan-cheng just checking in - has any progress been made on clearing up this issue, or should we continue with the pre-202204 version?

@StevePny It turns out that the default stack size, 8 MB, is insufficient to hold the large one-dimension variable, lbms. The solution is to set an unlimited stack size.

To provide a clarifying detail -
The docker container does not inherit the system stack limit by default. The ulimit can be set on the command line when running the docker container, but 'unlimited' is not a permitted option. In order to specify an unlimited stack size in the docker container, one can add this option:

--ulimit stack=-1

With this setting I can run the regional_Laura_test case on an AWS c6g.8xlarge ec2 instance.

Note - to be safe, I also set the stack size in the ec2 instance with:
ulimit -s unlimited