GCHP v14.3.0 hanging with reading Restart file with gnu compiler and openmpi

Question

GCHP v14.3.0 hanging with reading Restart file with gnu compiler and openmpi

yuanjianz opened this issue 7 months ago · 7 comments

Name and Institution (Required)

Name:Yuanjian Zhang
Institution: Washington University in St. Louis

Confirm you have reviewed the following documentation

Description of your issue or question

I am running GCHP v14.3 /dev/no-diff-to-benckmark branch(fixed some problems for GEOS-IT). My configuration is Standard Full Chem with C180 native cs GEOS-IT. I am using gnu compiler and openmpi (environment same as @yidant 's official GCHP docker).

The issue is, multi-node jobs like mine with 600-cores and 300GB memory seem to be stuck with Restart file reading.

MAPL_StateCreateFromSpecNew: var PS2 already exists. Skipping ...
                                                       Mem/Swap Used (MB) at GCHPctmEnvMAPL_GenericInitialize=  9.224E+04  0.000E+00
     GCHPctmEnv: INFO: Configured to expect 'top-down' meteorological data from 'ExtData'
     GCHPctmEnv: INFO: Configured to use dry air pressure in advection
     GCHPctmEnv: INFO: Configured to correct native mass flux (if using) for humidity
 Real*4 Resource Parameter: GCHPchem_DT:1200.000000
 Integer*4 Resource Parameter: GCHPchem_REFERENCE_TIME:1000
 Character Resource Parameter: GCHPchem_INTERNAL_RESTART_FILE:gchp_restart.nc4
 Character Resource Parameter: MAPL_ENABLE_BOOTSTRAP:YES
 Using parallel NetCDF for file: gchp_restart.nc4
   Bootstrapping Variable: ARCHV_DRY_TOTN in gchp_restart.nc4
   Bootstrapping Variable: ARCHV_WET_TOTN in gchp_restart.nc4
   Bootstrapping Variable: AREA in gchp_restart.nc4
   Bootstrapping Variable: AeroH2O_SNA in gchp_restart.nc4
   Bootstrapping Variable: DEP_RESERVOIR in gchp_restart.nc4
   Bootstrapping Variable: DRYPERIOD in gchp_restart.nc4
   Bootstrapping Variable: GCCTROPP in gchp_restart.nc4
   Bootstrapping Variable: GWET_PREV in gchp_restart.nc4
   Bootstrapping Variable: LAI_PREVDAY in gchp_restart.nc4
   Bootstrapping Variable: ORVCSESQ in gchp_restart.nc4
   Bootstrapping Variable: PARDF_DAVG in gchp_restart.nc4
   Bootstrapping Variable: PARDR_DAVG in gchp_restart.nc4
   Bootstrapping Variable: PFACTOR in gchp_restart.nc4
   Bootstrapping Variable: STATE_PSC in gchp_restart.nc4
   Bootstrapping Variable: T_DAVG in gchp_restart.nc4
   Bootstrapping Variable: T_PREVDAY in gchp_restart.nc4

I have tried to identify the issue:

small scale C30 running with 60 cores works.
subsituting with intel compiler and intelmpi would pass this step but end up crashing with a segmentation fault later(guess because of outdated ESMF v8.3.1)
enable parallel NetCDF and 6 readers does not work.

@yidant encountered similar issue but we are not sure about how to fix it.

Here is my configuration and logfile:
gchp.20190701_0000z.log
runjob.log
setCommonRunSettings.log

Answer 1 · 2024-02-27T14:52:58.000Z

subsituting with intel compiler and intelmpi would pass this step but end up crashing with a segmentation fault later(guess because of outdated ESMF v8.3.1)

I think we should pursue trying to get this configuration to work. Why do you think the seg fault is happening later due to outdated ESMF? Is there an ESMF error message?

If you can get it working with this configuration of libraries then at least you will have it running. We can then go back and figure out the issues with GNU/OpenMPI, potentially creating a github issue with the MAPL developers.

Answer 2 · 2024-02-27T16:25:06.000Z

@lizziel Sure. Shall I continue here or open another issue? I am not sure about it is about ESMF. Maybe we can find some hints out of the log file.
gchp.20190701_0000z.log
runjob.log

It seems like a Cloud-J problem. I tried to identify the line the logger tracebacked and it was just a common loop I think.

==== backtrace (tid:     67) ====
 0 0x00000000016a2ba0 fjx_sub_mod_mp_blkslv_()  /Projects/GCHP/14.3/src/GCHP_GridComp/Cloud-J/src/Core/fjx_sub_mod.f90:1355

1355       do K = 1,W_+W_r
1356       if (LDOKR(K) .gt. 0) then
1357        call GEN_ID (POMEGA(1,1,K),FZ(1,K),ZTAU(1,K),FSBOT(K),RFL(1,K), &
1358              PM,PM0, B(1,1,1,K),CC(1,1,1,K),AA(1,1,1,K), &
1359                      A(1,1,K),H(1,1,K),C(1,1,K), ND)
1360       endif
1361       enddo

Intel compiler:19.1.0.166 20191121
Intelmpi:Version 2019 Update 6 Build 20191024
ESMF:8.3.1

Answer 3 · 2024-02-27T21:23:27.000Z

Yes, please make a new issue for this. Thanks!

Answer 4 · 2024-03-15T17:23:35.000Z

Hi @yuanjianz and @yidant, has the issue with GNU and OpenMPI been resolved?

Answer 5 · 2024-03-15T18:13:09.000Z

Hi @lizziel, I tested it just now. It is still hanging at the

Bootstrapping Variable: T_PREVDAY in gchp_restart.nc4

when using more 1 node (the exact test scenario is 48x2=c96 cores C30 GEOSIT native mass flux).

I ran the GEOS-IT and MERRA-2 C24 benchmark with 72 cores on a single node with GNU successfully yesterday, so I assume it could be a MPI issue.

Answer 6 · 2024-03-17T04:38:08.000Z

For an update (2024.3.16),

I am closing this issue because I found the issue related to old version of openmpi. The GCHP official docker geoschem/gchp:14.3.0 currently utilizes openmpi 3.0.5, which does not match the recommended openmpi version >= 4 in the official documentation.

I mannually update it to 4.1.1 and MPI jobs' performance significantly enhances. I will be working with @yidant to update official docker.

Answer 7 · 2024-03-18T14:20:10.000Z

Ah, excellent. Yes, I think we had to update to OpenMPI 4.0 quite a while ago.