idia-astro/pipelines

Single node pipeline processing

manuparra opened this issue · 2 comments

Hi all, I've been reviewing the code and I see that the pipeline scripts are tightly coupled to their execution in slurm (of course this is a pipeline for this kind of clusters). We are doing some tests to run the pipeline without slurm, directly using singularity, but we see that there are references to variables and of course to procedures of the slurm structure, which makes this complicated.

Do you think it could be ported to a model without slurm?

Hi @manuparra, it's generally on the road map to support other non-SLURM platforms, although I must admit that single node/VM support isn't a strong part of that, as multi-node processing is a crucial design of the pipeline. However, I believe one can use MPI on a single node/VM anyway, assuming there are enough cores to make it worthwhile. And a few tasks like tclean can also make use of multiple cores through OpenMP.

So, with a few tweaks, I have used the pipeline successfully on a single VM. However, that was before version 1.1, in which we introduced SPW splitting.

To do this, one can take the sbatch scripts written and simply run them as bash scripts, since the #SBATCH lines are commented out. The tweaks include removing the SLURM srun wrapper, which is easily done within the code, or otherwise making srun a script/alias or something understood by the system. Another tweak is to write your own submit_pipeline.sh script, where instead of using SLURM dependencies, you use the bash && operator, which will achieve a similar effect by running each job after the previous one successfully finishes, or discontinuing the job if the previous job crashes. So for example, if you first make all the sbatch scripts executable (e.g. with chmod +x validate_input.sbatch), using the default scripts in the scripts config parameter, you could write something like this:

./partition.sbatch && ./validate_input.sbatch && ./flag_round_1.sbatch && ./calc_refant.sbatch && ./setjy.sbatch && ./xx_yy_solve.sbatch && ./xx_yy_apply.sbatch && ./flag_round_2.sbatch && ./xx_yy_solve.sbatch && ./xx_yy_apply.sbatch && ./split.sbatch && ./quick_tclean.sbatch

And you might also want to look at redirecting the output to a log within your sbatch scripts, with something like this:

1> logs/validate_input.out 2> logs/validate_input.err

However, doing this with SPW splitting (where nspw > 1) would require a bit more thought. Writing a script to launch the custom submit_pipeline.sh scripts inside each SPW wouldn't be difficult, but the trick after that would be automatically running the post-cross-calibration scripts such as concatenation and further selfcal and science imaging. But it would be easy splitting that into a separate step following the previous example. It would just require further intervention by the user after the first cross-calibration steps have run over all SPWs.

Another trick that might be useful is the [-l --local] option, which bypasses SLURM/srun and builds the pipeline without it.

Are you thinking of doing some of this development yourself? What's the platform and software you're using? I'd be happy to walk you through doing some of these things if it's useful.

Hi @manuparra, I suppose you could consider a high level script that runs the custom submit_pipeline.sh scripts inside the SPW directories, and then also uses the && operator and || operator (e.g. when you wish to run concat, selfcal and science imaging even if some SPWs failed) to then run the final imaging steps in a similar way.

Of course, beyond that, you could abandon the bash approach altogether and use Python or something else, but that may require quite a bit of development.