adjtomo/seisflows

how to incorporate Gian's new mpi system class?

Closed this issue · 16 comments

Hi Gian,

I've tested your system.mpi class. It works for me right out of the box, both running it on the login node and under SLURM. If I understand, this could be a really significant improvement in terms of reliability and reduced code duplication.

  1. Should we go forward and make pbs_sm, lsf_sm, slurm_sm inherit from mpi? In each case, all that would be required, I think, is overloading the submit method. The current slurm_sm could be renamed slurm_md or something like that. Also, there would no longer be a need I think to distinguish between pbs_sm and pbs_torque_sm.

  2. Since we've both found pbsdsh unreliable, should we remove this way of doing things entirely?

Ryan

That's good to know.

  1. That sounds like a reasonable way to increase the code consistency between platforms and reduce the number of independent system classes. Initially my main concern was that specfem was an inherently mpi based code but I see there are options for serial builds.

  2. I think this should supersede pbsdsh. I would still be interested in knowing how secure/prone to failure this could be. I haven't had issues yet but the MPI standard doesn't always favour accessing system/popen commands within MPI processes. It seems exceptions are allowed but they are contingent on the user/user's code.

https://www.open-mpi.org/faq/?category=tuning#fork-warning

I have to admit, I don't fully understand the fork issue. I do remember hearing once that using subprocess.call to call an mpi executable from within a python script (e.g. suprocess.call(mpirun bin/xspecfem3d') )can be problematic, but this seems like a different issue, in practice there has never been any trouble.

Just to be sure, let me run some longer running tests, and if there are still no problems, I'll go ahead and submit a pull request with the above changes.

I think one issue could be mpi processes performing conflicting commands. For example if a process attempts to read/write/manipulate a file or a path, it could easily lead to issues if the code is designed properly. That said, since seisflows deploys things as independent tasks (in serial) I should think this would be avoided.

The basic design pattern I think is called a ''work queue". I'm a bit surprised and stumped that something like this kind would cause trouble...my inexperience with MPI programming starting to show.

Hi Gian, I've submitted a pull request. Could you look it over, and if there are no problems, hit merge? I think you should have ability to merge pull requests, but just want to double check.

Ignore the prior comment about the queue deployment; I'm fairly sure it was a memory issue. Looking into some Python based task queue managers could be useful for load balancing in the future.

Agreed, such managers could be helpful. Not having looked much at what's out there, I'm curious what balance there is between simplicity/usability on the one hand and power/flexibility on the other. Relatively simple solutions would be prefered I think, at least for the main repository.

Concerning system.mpi, one issue I noticed is that error messages such as

exceeded allocated walltime

are sometimes replaced by more cryptic MPI error messages such as

A daemon (pid unknown) died unexpectedly with status 1 while attempting to launch so we are aborting.

Aside from this, performance seems exactly the same as the old way of doing things in all the tests run so far.

An important issue that might have been overlooked.

Does anyone know whether

mpiexec -n 1 --mca mpi_warn_on_fork 0

works across clusters?

To what extent is the mpiexec command standardized? Does command argument syntax differ depending on whether mpiexec was compiled using openmpi or mpich libraries? Don't some clusters have mpirun or aprun in place of mpiexec?

The environment variable OMPI_COMM_WORLD_RANK definitely seems openmpi specific. Perhaps we should rename system.mpi --> system.openmpi?

I believe mpiexec is part of the MPI standard. It seems mpirun was used in
older standards of the language. Open-mpi uses them as synonyms of one
another.

As for the MCA warning, that seems to be specific to Open-mpi; it's unclear whether MPICH issues a similar warning.

An alternative might be to allow users to specify the MPI implementation they're using and then assign appropriate variables/commands from there. That would save having to refactor the inheritance for the small cluster classes. That said, It depends on whether the goal is to keep the number of different classes minimal or not.

The global variable does not translate to other implementations. Environment variables can be passed to MPICH (mpiexec -env ) and Open-MPI (mpiexec -x =) but the syntax varies. Has there been considerations of a system independent way of setting the node id?

An alternative might be to allow users to specify the MPI implementation they're using and then assign appropriate variables/commands from there.

Agreed, this sounds worth considering.

Environment variables can be passed to MPICH (mpiexec -env ) and Open-MPI (mpiexec -x =) but the syntax varies. 

I guess it comes down to, is there some way to pass environment variable values in such that each MPI process receives a different value? It seems like, for the -env and -x flags, all MPI processes receive the same passed value.

Rather than checking the OpenMPI environment variable, couldn't we just use mpi4py's Get_rank function? In other words,

    def getnode(self):
        """Gets number of running task"""
        from mpi4py import MPI
        returm MPI.COMM_WORLD.Get_rank()

This is a system indepdent approach, right? Running under SLURM, it seems to work for me.

Even if this does solve the problem for getnode, we still may have to think about is how to invoke wrapper/run_mpi.

Yes, that change should be fine and solve that problem.

I'm not familiar with how launching with aprun compares to mpiexec so I can't comment there. For MPICH/open-mpi the only difference seems to be the fork warning which could be suppressed with a parameter option. Testing the class on the various implementations is something I haven't done yet but needs to be done.

The new class seems pretty well integrated now, so if it's alright, I'll go ahead and close this issue. If problems are discovered later, please feel free to open a new one.