adjtomo/seisflows

broken PBS system classes

rmodrak opened this issue · 6 comments

James and I have been working on these classes. At the moment they are not working--in fact, there are a lot of hardwired variables and debug statements on the master branch. Hope to address this when James arrives back.

Comments from @gianmatharu :

One thing I did notice in the pbs system class was that the system.run invoked srun and not pbsdsh, is this correct? I thought srun was a SLURM command or are you running on a cluster with multiple scheduling options?

I adjusted the pbsdsh version and managed to run it on a PBS cluster. One problem I did encounter was that pbsdsh spawns minimalistic environments that can cause issues if you're relying on certain environment variables.

responses to @gianmatharu comments:

One thing I did notice in the pbs system class was that the system.run invoked srun and not pbsdsh, is this correct? I thought srun was a SLURM command or are you running on a cluster with multiple scheduling options? 

That is definitely a mistake. Currently, another grad student, James Smith, is working on fixing the PBS interfaces.

One problem I did encounter was that pbsdsh  spawns minimalistic environments that can cause issues if you're relying on certain environment variables.

This is an issue I remember having with PBS. Depending on which version you're using, you may be able to export environment variabes through pbsdsh arguments, or in the worst case, through command line arguments to the exectuable being run through pbsdsh.

I adjusted the pbsdsh version and managed to run it on a PBS cluster. 

Excellent. So an entire model update iteration runs successfully?

As a quick fix I resorted to a shell script to export variables and then
call the wrappers. I may change this at some point.

I'm still testing it, I had some crashes due to race conditions causing
conflicts which I've since fixed. Under certain situations i've run into
peculiar error messages which I believe arise from pbsdsh and the MPI
calls; I need to investigate this further. Once I have things working 100%
I'll update you, should be sometime this weekend.

On Fri, Feb 19, 2016 at 1:17 PM, rmodrak notifications@github.com wrote:

responses to @gianmatharu https://github.com/gianmatharu comments:

One thing I did notice in the pbs system class was that the system.run invoked srun and not pbsdsh, is this correct? I thought srun was a SLURM command or are you running on a cluster with multiple scheduling options?

That is definitely a mistake. Currently, another grad student, James
Smith, is working on fixing the PBS interfaces.

One problem I did encounter was that pbsdsh spawns minimalistic environments that can cause issues if you're relying on certain environment variables.

This is an issue I remember having with PBS. Depending on which version
you're using, you may be able to export environment variabes through pbsdsh
arguments, or in the worst case, through command line arguments to the
exectuable being run through pbsdsh.

I adjusted the pbsdsh version and managed to run it on a PBS cluster.

Excellent. So an entire model update iteration runs successfully?


Reply to this email directly or view it on GitHub
#19 (comment)
.

Update: I have adjusted the pbs_sm class and have a working system class that successfully performs updates. In the current implementation pbsdsh calls a shell script to export certain variables to the pbsdsh environment. I could send a copy for you to look over prior to submitting a pull request.

Hi Gian, Feel free to submit a pull request right away. Could you please
name the module 'torque_sm'? Is that alright? Or perhaps 'pbs_torque_sm',
whichever you prefer. On our end we are working with PBS Pro. Thanks, Ryan

On Wed, Feb 24, 2016 at 1:24 PM, gianmatharu notifications@github.com
wrote:

Update: I have adjusted the pbs_sm class and have a working system class
that successfully performs updates. In the current implementation pbsdsh
calls a shell script to export certain variables to the pbsdsh environment.
I could send a copy for you to look over prior to submitting a pull
request.


Reply to this email directly or view it on GitHub
#19 (comment)
.

From james' tests, it seems the pbs_lg class is working, and I believe the same is true for Gian and the pbs_sm class. Thus, if it's alright, I'll go ahead and close this issue.