darshan-hpc/darshan

Deadlock with simple hello-world

pramodk opened this issue · 6 comments

Dear Darshan Team,

I am seeing confusing behavior and I would like to check if I am missing something obvious here. I have seen #559 but not sure if it's the same issue (TBH, I might be wrong as I didn't get time to look into the details):

Here is a quick summary:

  • Let's say we have a simple hello-world that is not doing anything useful:
#include <mpi.h>

int main(int argc, char**argv) {
  MPI_Init(&argc, &argv);

#ifdef ENABLE_MPI
  MPI_File fh;
  MPI_File_open(MPI_COMM_WORLD, "test.foo", MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL, &fh);
  MPI_File_close(&fh);
#endif

  MPI_Finalize();
}
  • I set up Darshan in a typical way and have everything working with Intel MPI:
module load unstable gcc intel-oneapi-mpi darshan-runtime

mpicxx -g hello_world.cpp -o hello

export DARSHAN_LOG_DIR_PATH=$PWD

DARSHAN_DIR=$(dirname `which darshan-config`)/../
export LD_PRELOAD=$DARSHAN_DIR/lib/libdarshan.so

mpirun ./hello         # not-relevant: using mpirun launcher instead of srun for convenience here

produces:

$ ls -l kumbhar_hello_id254834-254834_4-13-5604-15684913809476855937_1.darshan
-r--------+ 1 kumbhar bbp     1635 Apr 13 01:33 kumbhar_hello_id254834-254834_4-13-5604-15684913809476855937_1.darshan

With ROMIO_PRINT_HINTS=1, we know that Intel-MPI uses NFS as the default ADIO driver:

key = romio_filesystem_type     value = NFS:

So, if I force the GPFS driver then the program gets stuck:

export I_MPI_EXTRA_FILESYSTEM_FORCE=gpfs

As another example, let's look at HPE-MPI (MPT) library:

module load unstable gcc hpe-mpi darshan-runtime

mpicxx -g hello_world.cpp -o hello  

export DARSHAN_LOG_DIR_PATH=$PWD

DARSHAN_DIR=$(dirname `which darshan-config`)/../
export LD_PRELOAD=$DARSHAN_DIR/lib/libdarshan.so

srun ./hello

this also gets stuck! I see that .darshan_partial is generated though:

-rw-r-----+ 1 kumbhar bbp     1673 Apr 13 01:36 kumbhar_hello_id255412-255412_4-13-5806-8325685343036153230.darshan_partial

I got confused because I have MPI I/O applications that are working fine. For example, in the above test example, let's enable part of the code that just opens a file using MPI/O:

$ mpicxx -g hello_world.cpp -o hello  -DENABLE_MPI=1

and then srun ./hello finishes! 🤔 (at least for the few times I tried)

Launching DDT on the exe without -DENABLE_MPI=1, the stack trace for 2 ranks looks like below:

image

which appears a bit confusing (?). (By the way, I quickly verified MPI_File_write_at_all works with 0 as count)

I didn't spend too much time digging into ROMIO or Darshan code. I thought I should first ask here if this is something looks obvious to the developer team or if you have seen this before.

Thank you in advance!

carns commented

Hi @pramodk , can you try running one of your deadlocking examples with this environment variable set?

export DARSHAN_LOGHINTS=""

It's been a little while since we've encountered this, but it's possible that the ROMIO driver for the file system has a bug that's only triggered when using the hints that Darshan sets when writing the log file.

carns commented

For a little more background, Darshan sets "romio_no_indep_rw=true;cb_nodes=4" by default. Taken together, they indicate that regardless of how many ranks the application has, only 4 of them will actually open the Darshan log and act as aggregators. This is helpful at scale to keep cost of opening the log file from getting too high.

Out of curiosity, what does DDT say about the location of the first hang you mention (Intel MPI forcing gpfs ADIO)? Maybe it's failing the collective create of the log file since we don't see any evidence of an output log created?

The 2nd hang (MPT) you mention is clearly hanging the very first time Darshan tries to do collective writes to the log file -- log file creation clearly succeeds as you get the .darshan_partial log. Phil's suggestion has sometimes helped with this sort of thing, so that is worth trying.

(just a quick partial response, will answer other questions tomorrow)

@carns:

can you try running one of your deadlocking examples with this environment variable set?
export DARSHAN_LOGHINTS=""

Yes! I confirm that changing romio_no_indep_rw via DARSHAN_LOGHINTS run the program successfully i.e.

below fails

export ROMIO_PRINT_HINTS=1
DARSHAN_LOGHINTS="romio_no_indep_rw=true" srun ./hello
...
+ DARSHAN_LOGHINTS=romio_no_indep_rw=true
+ srun ./hello
key = romio_no_indep_rw         value = true
key = cb_buffer_size            value = 16777216
key = romio_cb_read             value = enable
key = romio_cb_write            value = enable
key = cb_nodes                  value = 2
key = romio_cb_pfr              value = disable
key = romio_cb_fr_types         value = aar
key = romio_cb_fr_alignment     value = 1
key = romio_cb_ds_threshold     value = 0
key = romio_cb_alltoall         value = automatic
key = ind_rd_buffer_size        value = 4194304
key = ind_wr_buffer_size        value = 524288
key = romio_ds_read             value = automatic
key = romio_ds_write            value = automatic
key = cb_config_list            value = *:1
key = romio_filesystem_type     value = GPFS: IBM GPFS
key = romio_aggregator_list     value = 0 2
...
...other errors / deadlock...
...

but below succeeds:

export ROMIO_PRINT_HINTS=1
DARSHAN_LOGHINTS="romio_no_indep_rw=false" srun ./hello

srun ./hello
key = romio_no_indep_rw         value = false
key = cb_buffer_size            value = 16777216
key = romio_cb_read             value = automatic
key = romio_cb_write            value = automatic
key = cb_nodes                  value = 2
key = romio_cb_pfr              value = disable
key = romio_cb_fr_types         value = aar
key = romio_cb_fr_alignment     value = 1
key = romio_cb_ds_threshold     value = 0
key = romio_cb_alltoall         value = automatic
key = ind_rd_buffer_size        value = 4194304
key = ind_wr_buffer_size        value = 524288
key = romio_ds_read             value = automatic
key = romio_ds_write            value = automatic
key = cb_config_list            value = *:1
key = romio_filesystem_type     value = GPFS: IBM GPFS
key = romio_aggregator_list     value = 0 2
carns commented

Wow, thanks for confirming. If you can share the exact MPI library / version you are using when you fill in more details later, that would be great. This is possibly a vendor bug that should be reported. At worst the hint should just be unsupported, not faulty.

If you would like, you can also configure darshan with the --with-log-hints="..." configure option so that a different default is compiled in (so that the resulting library is safe to use without having to set the explicit environment variable every time).

Probably related to pmodels/mpich#6408