LLNL/UnifyFS

MPI_File_open fails on two nodes on Frontier

adammoody opened this issue · 9 comments

When running a two-process, two-node test on Frontier, it seems that MPI_File_open returns an error. I ran into this when PnetCDF tests were failing and I simplified down to this reproducer.

#include <stdio.h>
#include "mpi.h"

int main(int argc, char* argv[])
{
  int rc;

  char filename[] = "/unifyfs/foo";

  MPI_Init(&argc, &argv);

  int rank, ranks;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &ranks);
  printf("%d of %d\n", rank, ranks);
  fflush(stdout);

  MPI_File fh;
  int amode = MPI_MODE_CREATE | MPI_MODE_RDWR;
  rc = MPI_File_open(MPI_COMM_WORLD, filename, amode, MPI_INFO_NULL, &fh);
  printf("%d\n", rc);
  fflush(stdout);

  rc = MPI_File_close(&fh);
  printf("%d\n", rc);
  fflush(stdout);

  MPI_Finalize();
  return 0;
}

Built with:

#!/bin/bash
module use /sw/frontier/unifyfs/modulefiles
module load unifyfs/1.1/gcc-12.2.0
module load gcc/12.2.0
module load PrgEnv-gnu
module unload darshan-runtime

mpicc -o mpiopen mpiopen.c

Here is the script used to configure and launch. These settings probably don't matter, but I'll capture them just in case.

#!/bin/bash
# salloc -N 2 -p batch

installdir=/sw/frontier/unifyfs/spack/env/unifyfs-1.1/gcc-12.2.0/view

# disable data sieving
#>>: cat romio_hints.txt 
#romio_ds_read disable
#romio_ds_write disable
export ROMIO_HINTS=`pwd`/romio_hints.txt

# https://www.nersc.gov/assets/Uploads/MPI-Tips-rgb.pdf
#export MPICH_MPIIO_HINTS_DISPLAY=1
export MPICH_MPIIO_HINTS="romio_ds_read=disable,romio_ds_write=disable"

# http://cucis.eecs.northwestern.edu/projects/PnetCDF/doc/pnetcdf-c/PNETCDF_005fHINTS.html
export PNETCDF_HINTS="romio_ds_read=disable;romio_ds_write=disable"

export UNIFYFS_MARGO_CLIENT_TIMEOUT=70000

export UNIFYFS_CONFIGFILE=/var/tmp/unifyfs.conf
touch $UNIFYFS_CONFIGFILE

export UNIFYFS_CLIENT_LOCAL_EXTENTS=0
export UNIFYFS_CLIENT_WRITE_SYNC=0
export UNIFYFS_CLIENT_SUPER_MAGIC=0

# sleep for some time after unlink
# see https://github.com/LLNL/UnifyFS/issues/744
export UNIFYFS_CLIENT_UNLINK_USECS=1000000

srun --overlap -n 2 -N 2 mkdir -p /dev/shm/unifyfs
export UNIFYFS_LOGIO_SPILL_DIR=/dev/shm/unifyfs

# test_ncmpi_put_var1_schar executes many small writes,
# it was necessary to reduce the chunk size to avoid exhausing space
export UNIFYFS_LOGIO_CHUNK_SIZE=$(expr 1 \* 4096)
export UNIFYFS_LOGIO_SHMEM_SIZE=$(expr 1024 \* 1048576)
export UNIFYFS_LOGIO_SPILL_SIZE=$(expr 0 \* 1048576)

export UNIFYFS_LOG_DIR=`pwd`/logs
export UNIFYFS_LOG_VERBOSITY=1

export LD_LIBRARY_PATH="${installdir}/lib:${installdir}/lib64:$LD_LIBRARY_PATH"

# turn off darshan profiling
export DARSHAN_DISABLE=1

export LD_PRELOAD="${installdir}/lib/libunifyfs_mpi_gotcha.so"
srun --label --overlap -n 2 -N 2 ./mpiopen

Running that, I get the following output:

+ srun --label --overlap -n 2 -N 2 ./mpiopen
1: 1 of 2
0: 0 of 2
1: 1006679845
1: 201911579
0: 469947936
0: 201911579

I should have run those return codes through MPI_Error_string. Anyway, you can see the first integer printed by rank 0 is different from rank 1, so at least one of those got something other than MPI_SUCCESS, maybe both. In my PnetCDF test, rank 1 usually reports ENOENT while rank 0 detects that rank 1 failed and reports a more generic error.

@MichaelBrim , are you able to reproduce this?

@adammoody Couple questions:

  1. I don't see link against libunifyfs_mpi_gotcha in your build command. Is that just an oversight when submitting the issue?
  2. Did you allocate NVM resources to your job (i.e., using '-C nvme') and then use srun? Without the NVM option, the module-provided setting for UNIFYFS_LOGIO_SPILL_DIR (/mnt/bb/$USER) won't exist.

I think you answered my questions at the same time I posted them. Next question, I don't see anything in that script
that's launching the servers.

Thanks, @MichaelBrim .

I'm launching the servers manually using the following script:

#!/bin/bash

module use /sw/frontier/unifyfs/modulefiles
#module load unifyfs/1.1/gcc-12.2.0
#module show unifyfs/1.1/gcc-12.2.0

module load gcc/12.2.0
module load PrgEnv-gnu

module unload darshan-runtime

set -x

installdir=/sw/frontier/unifyfs/spack/env/unifyfs-1.1/gcc-12.2.0/view

export LD_LIBRARY_PATH=${installdir}/lib:${installdir}/lib64:$LD_LIBRARY_PATH

procs=$SLURM_NNODES

srun -n $procs -N $procs touch /var/tmp/unifyfs.conf
export UNIFYFS_CONFIGFILE=/var/tmp/unifyfs.conf

export UNIFYFS_MARGO_CLIENT_TIMEOUT=700000
export UNIFYFS_MARGO_SERVER_TIMEOUT=800000

export UNIFYFS_SERVER_LOCAL_EXTENTS=0

export UNIFYFS_SHAREDFS_DIR=/lustre/orion/csc300/scratch/$USER

export UNIFYFS_DAEMONIZE=off

export UNIFYFS_LOGIO_CHUNK_SIZE=$(expr 1 \* 65536)
export UNIFYFS_LOGIO_SHMEM_SIZE=$(expr 64 \* 1048576)
export UNIFYFS_LOGIO_SPILL_SIZE=$(expr 0 \* 1048576)

srun -n $procs -N $procs mkdir -p /dev/shm/unifyfs
export UNIFYFS_LOGIO_SPILL_DIR=/dev/shm/unifyfs

export UNIFYFS_LOG_DIR=`pwd`/logs
export UNIFYFS_LOG_VERBOSITY=5

export ABT_THREAD_STACKSIZE=256000

srun -n $procs -N $procs ${installdir}/bin/unifyfsd &

I execute that script to launch the servers, let things settle for about 10 seconds, and then run the earlier script to launch the application. I note here that I'm not loading the unifyfs module, but I'm directly pointing to the directory in LD_LIBRARY_PATH.

All of these tests are also trying to use shared memory only. I've pointed the spill directory to /dev/shm, but also set its size to 0. Don't know that this matters, but just pointing it out.

Any reason you're not using unifyfs start to launch the servers? Its srun includes the options '--exact --overlap --ntasks-per-node=1', which may be necessary to successfully run. Also, what's your working directory when running - I ask since you're putting server logs in $PWD/logs, which would be bad if your still in the /ccs/proj/csc300 area, since it's read-only on compute nodes.

I was launching manually because I've been doing a lot of debugging with totalview. Sometimes I need to debug the servers and sometimes the client application. I haven't figured out the best way to do this with unifyfs start, and so I tend to keep reusing these manual launch methods.

I'm running out of my home directory /ccs/home/$USER. The log files do seem to show up. I've been using those to debug, as well.

I am unable to reproduce this issue in my environment using my normal unifyfs job setup that uses NVM, not shmem. Here's the successful app output I get.

> more mpiio-issue788-gotcha.out.*
::::::::::::::
mpiio-issue788-gotcha.out.frontier03321.2.0
::::::::::::::
0 of 2
0
0
::::::::::::::
mpiio-issue788-gotcha.out.frontier03322.2.1
::::::::::::::
1 of 2
0
0

Ok, good to know. Must be something in my environment. Thanks for testing, @MichaelBrim

@adammoody, would you consider this resolved?

Yes, let's close this one as resolved.