AFD-Illinois/iharm3d

problem running in parallel in MareNostrum cluster

cpalenzuela opened this issue · 4 comments

I am trying to run in the MareNostrum cluster by using the following submit script

##################################
#!/bin/bash
#SBATCH --job-name=iHARM
#SBATCH--qos=debug
#SBATCH --output=bns_%j.out
#SBATCH --error=bns_%j.err
#SBATCH --ntasks-per-node 1
#SBATCH -N 2
#SBATCH --time=01:00:00

#module load python/3.6.1
module load hdf5 gsl szip

Schedule based on 2x hyperthreading, regardless of actual number of threads

NUM_CPUS=$(grep ^cpu\scores /proc/cpuinfo | uniq | awk '{print $4}')
export OMP_NUM_THREADS=$(( $NUM_CPUS ))

srun ./harm -p param.dat
#################################

with -N 1 works well, but with -N 2 it gives me the following error


load hdf5/1.8.19 (PATH, LD_LIBRARY_PATH, LIBRARY_PATH, C_INCLUDE_PATH,
CPLUS_INCLUDE_PATH)
load gsl/2.4 (PATH, LD_LIBRARY_PATH, LIBRARY_PATH, MANPATH, INFOPATH,
C_INCLUDE_PATH, CPLUS_INCLUDE_PATH, PKG_CONFIG_PATH, GSL_DIR)
load szip/2.1.1 (C_INCLUDE_PATH, LD_LIBRARY_PATH)
Fatal error in PMPI_Comm_rank: Invalid communicator, error stack:
PMPI_Comm_rank(122): MPI_Comm_rank(MPI_COMM_NULL, rank=0xa508ac) failed
PMPI_Comm_rank(75).: Null communicator
slurmstepd: error: *** STEP 17790569.0 ON s24r1b70 CANCELLED AT 2021-10-07T08:11:36 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: s24r1b70: task 0: Killed
srun: Terminating job step 17790569.0
srun: error: s24r1b71: task 1: Killed

do you know what the problem is?

Hi,
It looks like you're setting up an MPI communicator larger than the topology you've compiled into iharm3D. The extra process can't find a place in the communicator that iharm3D sets up, so it errors. This can be fixed by changing N1CPU, N2CPU, N3CPU compile-time parameters in build_archive/parameters.h, and rebuilding iharm3D. Just make sure that N1CPU * N2CPU * N3CPU equals the total number of MPI processes (nodes*procs-per-node).

To clarify, iharm3D takes the size of the MPI communicator as a compile-time parameter -- specifically, the number of processes over which to decompose the domain in the X1, X2, and X3 directions. Once compiled, iharm3D can be run only with that number of MPI processes -- so 'mpirun -N 1' and 'mpirun -N 2' will never both work for the same 'harm' binary.

This is because the decomposition will affect the per-process domain block size, which we want to be determined at compile time. We then set up a communicator of our predetermined size, and throw this error if there are more (or fewer) processes than expected. This has given us better control over the mesh decomposition while strictly adhering to 1 process == 1 mesh block -- albeit at the cost of being terribly counterintuitive!

Now it is working, thanks!

@bprather what do you think about including a quick (programatic) check when the code first starts that compares the number of MPI processes with the expectation from N#CPU? that way we could print out a more descriptive error message …

Implemented @gnwong's suggestion in #41, pending testing. Closing here now that things are working