geoschem/GCHP

Errors running GCHP using Singularity

Closed this issue · 15 comments

Name and Institution (Required)

Name: Tessa Clarizio
Institution: UIUC

Confirm you have reviewed the following documentation

Description of your issue or question

I have been trying to run GCHP v14.2.0 using Singularity. Our campus cluster system uses Slurm to submit jobs. I have been able to install GCHP successfully (see issue #347), however now I am running into issues actually running the model, where I have been getting 'unknown error' when the model is trying to read all the .90 files . The last line of my error is "95 more processes have sent help message help-mpi-api.txt / mpi-abort" @yidant has been helping with this issue but has come across similar challenges. We think it may be related to the internal vs external MPI compatibility, but are not really sure. It was recommended that we create a GitHub issue. Was wondering what are the next steps to take from here?

GCHP error
error_message_20240213.out.txt
runscript_slurm.sbatch.txt
setCommonRunSettings.txt

This is the error @yidant came across:

"The work we were doing is to use Singularity to run GCHP on their cluster with Slurm scheduler. We were using GCHP 14.2.0. The configuration I set is TOTAL_CORES=24, NUM_NODES=1, NUM_CORES_PER_NODE=24.
I received error message like this, when I saw it is related to restart files, I copied the restart files to substitute the symbolic link, but the error still exists.
Here’s my run script. I didn't use Slurm before so I don't know if there's something wrong with my script.
I have also tried to use interactive mode to re-compile GCHP 14.3.0 and test in the interactive job. When I use mpirun -n 24 ./gchp , I encountered pe errors which didn't occur when I used the same command on Compute1."

Error messages:

Starting PEs : 24
Starting Threads : 8

FATAL from PE 0: mpp_domains_define.inc: not all the pe_end are in the pelist

FATAL from PE 2: mpp_domains_define.inc: not all the pe_end are in the pelist

FATAL from PE 4: mpp_domains_define.inc: not all the pe_end are in the pelist

FATAL from PE 6: mpp_domains_define.inc: not all the pe_end are in the pelist

FATAL from PE 8: mpp_domains_define.inc: not all the pe_end are in the pelist

FATAL from PE 10: mpp_domains_define.inc: not all the pe_end are in the pelist

FATAL from PE 12: mpp_domains_define.inc: not all the pe_end are in the pelist

FATAL from PE 14: mpp_domains_define.inc: not all the pe_end are in the pelist

FATAL from PE 16: mpp_domains_define.inc: not all the pe_end are in the pelist

FATAL from PE 18: mpp_domains_define.inc: not all the pe_end are in the pelist

FATAL from PE 20: mpp_domains_define.inc: not all the pe_end are in the pelist

FATAL from PE 22: mpp_domains_define.inc: not all the pe_end are in the pelist

FATAL from PE 0: mpp_domains_define.inc: not all the pe_end are in the pelist


MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

[ccc0268.campuscluster.illinois.edu:03589] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 1741
[ccc0268.campuscluster.illinois.edu:03589] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 1741
gchp.20190701_0000z.log
yidan_error.txt
yidan_runscript.txt

Regarding the error in the second message, it looks like your restart file is missing a required species. From the error log:

pe=00022 FAIL at line=03056 NCIO.F90 <Could not find field SPC_BUTDI in gchp_restart.nc4 >
The standard simulation requires that you have all species in the restart file. You can turn this requirement off within setCommonRunSettings.sh. See these lines in the config file. Set it to zero to not require all species.

That being said, where are you getting your restart file? The restart files that come with the run directory should have all species. Are you using an older run directory with newer coder, or vice versa?

Regarding the issue in the first comment, the primary error message is the line that says "factories not equal". The errors show up as traceback, meaning error messages from deepest in the code all the way up to the highest level calling routine. Starting with the first error message is therefore the place to start, in this case line 6123 in file MAPL_Generic.F90 with message "factories not equal".

This error can happen when running stretched grid. See GCHP issue #318 for discussion about this error. I see you are also running stretched grid. How did you generate your restart file?

Hi @lizziel thank you for your help. To give more context
The Restart links were broken when we set up the run directory. So I downloaded from the WUSTL site http://geoschemdata.wustl.edu/ExtData/GEOSCHEM_RESTARTS/GC_14.2.0/ . Then I created a symbolic link to the relevant Restart file name that was in the directory. I think when I got the error message I mistakenly had used a Restart file from 14.0.0 instead of 14.2.0, but I have updated that now. I also tried running again and turned off stretch grid but I got a similar number of errors but the message is now "Error calling Linoz_Read"

(I originally had stretch grid on because I was attempting to follow this tutorial somewhat, but was just using a default Restart file not the one in the tutorial: https://gchp.readthedocs.io/en/latest/supplement/stretched-grid.html)
image
error_message_20240221.out.txt

Hi @tessac2, do you have the GEOS-Chem log as well? I wonder if there are any error messages there from within linoz_read.

gchp.20190701_0000z.log

Thanks @lizziel . Yes I have attached it here. Looking at the GEOS-Chem log it looks like it cannot find the files--it seems like there is an issue with ChemDir? Should this have been automatically populated when I set up the run directory?

image

Yes, those should have been automatically set during run directory creation. What are they pointing to instead? You can find out by using command file, e.g. file HcoDir.

These links are set using an ExtData path stored in a config file in directory .geoschem within your home directory. You are prompted to set that the very first time you create a run directory on a new compute cluster. Check if you have file .geoschem/config in your home directory. Directory .geoschem is hidden since it starts with . but you can see it with command ls in your home directory if you do ls -a. See here for GCHP docs on this.

I do have .geoschem/config but I do not have any other files. When I open the config file it has a path to my home directory but that is not where the ExtData is stored. To set up the run directory I had been using the below command. I don't get the message 'Enter path for ExtData' that is shown in the link provided--I think because I specify the ExtData in below. But it is odd that the config file has a different path than what I specify below.

singularity exec -B $HOME:$HOME -B /projects/horowitz_group/GEOSChem_input_data/ExtData:/ExtData -B /projects/horowitz_group/tessa/GCHP/rundirs:/workdir gchp.sif /bin/bash -c ". ~/.bashrc && /opt/geos-chem/bin/createRunDir.sh"

image
image

Try deleting the .geoschem directory and then create a new run directory. This will be a good test of whether there is an issue setting the path. When creating the run directory you only get prompted to set the ExtData path if directory .geoschem is not found.

Another option to fix this is to manually set the path to GC_DATA_ROOT in .geoschem/config and then create another run directory. However, if you could do the other method of deleting .geoschem that would be helpful to us to know if there is an issue with it.

Thanks @lizziel ! I deleted the .geoschem folder. However, now when I set up the run directory it says the path to the ExtData folder does not exist. Is there a different way I need to set it up since I am using a container (Singularity)?
image

Hi @tessac2 ! Yes, you've bind /projects/horowitz_group/GEOSChem_input_data/ExtData as /ExtData in the container, so you'll need to use /ExtData instead.

Thanks @yidant. I have done this but the ChemDir, CodeDir, HcoDir, MetDir are still not populated

image

Hi @tessac2 , if it is correctly set up, these files would link to the right path in the container. For example, ChemDir would link to /ExtData/CHEM_INPUTS. Then when you pull up the container to run GCHP, it will find the path inside it.

This works at WashU, but if it doesn't work, I would suggest you also bind the whole data folders like /projects/horowitz_group/GEOSChem_input_data/ExtData:/projects/horowitz_group/GEOSChem_input_data/ExtData. Then you'll can set up the ExtData path as /projects/horowitz_group/GEOSChem_input_data/ExtData. In this way, you'll find the directories are available outside the container.

Thanks @yidant ! It seems that binding the whole data folders has worked for me! It seems that the CodeDir link is still broken but I wonder if that is ok--it seems like GCHP is running ok without it at the moment?

image

Hi @tessac2 , it looks correct for CodeDir linking to the source code in the container. If GCHP runs well, it is all good!