Issue with package modules and container

Question

Issue with package modules and container

Closed this issue 9 months ago · 10 comments

Hi @Jordatious,
I'm working to adapt the pipeline to our cluster. I have made the modifications and it seems to create everything correctly, send the jobs, etc. But when I run ./submit_pipeline.sh, it terminates too soon (that's bad), so checking the errors I see the following:

./findErrors.sh

SPW #1: /mnt/slurm-jobs/pipelines/1382~1383MHz
logs/validate_input-1591.casa  logs/validate_input-1591.err  logs/validate_input-1591.mpi  logs/validate_input-1591.out
logs/flag_round_1-1592.err  logs/flag_round_1-1592.out
ModuleNotFoundError: No module named 'config_parser'
ModuleNotFoundError: No module named 'config_parser'
ModuleNotFoundError: No module named 'config_parser'
ModuleNotFoundError: No module named 'config_parser'
ModuleNotFoundError: No module named 'config_parser'
ModuleNotFoundError: No module named 'config_parser'
ModuleNotFoundError: No module named 'config_parser'
ModuleNotFoundError: No module named 'config_parser'
ModuleNotFoundError: No module named 'config_parser'
ModuleNotFoundError: No module named 'config_parser'
logs/*1593* logs don't exist (yet)
logs/*1594* logs don't exist (yet)
...
SPW #2: /mnt/slurm-jobs/pipelines/1383~1384MHz
logs/validate_input-1602.casa  logs/validate_input-1602.err  logs/validate_input-1602.mpi  logs/validate_input-1602.out
logs/flag_round_1-1603.err  logs/flag_round_1-1603.out
ModuleNotFoundError: No module named 'config_parser'
ModuleNotFoundError: No module named 'config_parser'
FATAL:   container creation failed: mount /proc/self/fd/3->/usr/local/var/singularity/mnt/session/rootfs error: while mounting image /proc/self/fd/3: failed to find loop device: unexpected EOF
ModuleNotFoundError: No module named 'config_parser'
...

Here you can see ModuleNotFoundError: No module named 'config_parser' so in this point it is calling to this script, but it is not imported from proceessMeerKAT.py or something happen with the python paths.

Then, you can see FATAL: container creation failed: mount /proc/self/fd/3->/usr/local/var/singularity/mnt/session/rootfs error: while mounting image /proc/self/fd/3: failed to find loop device: unexpected EOF, it is not in all the SPWs, so I dont know what this error is about.

More details (and files generated):

echo %PATH I have: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/mnt/slurm-jobs/pipelines/processMeerKAT
A piece of code from the submit_pipeline.sh:

#!/bin/bash

#partition.sbatch
allSPWIDs=$(sbatch partition.sbatch | cut -d ' ' -f4)
echo Running partition job array, iterating over 11 SPWs.

partitionID=$(echo $allSPWIDs | cut -d , -f1)

#Add time as extn to this pipeline run, to give unique filenames
DATE=2024-01-25-13-21-07
mkdir -p jobScripts
mkdir -p logs

echo Running pipeline in directory "1382~1383MHz" for spectral window *:1382~1383MHz
cd 1382~1383MHz
output=$(processMeerKAT.py --config ./imaging-line-config.txt --run --submit --quiet --justrun --dependencies=$partitionID\_0)
echo -e $output
...

partition.sbatch

#!/bin/bash
#SBATCH --array=0-10%8
#SBATCH --account=ubuntu
#SBATCH --nodes=3
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=1
#SBATCH --mem=24GB
#SBATCH --job-name=partition
#SBATCH --distribution=plane=1
#SBATCH --output=logs/%x-%A_%a.out
#SBATCH --error=logs/%x-%A_%a.err
#SBATCH --partition=debug
#SBATCH --time=12:00:00

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
module load openmpi/4.0.3


#Iterate over SPWs in job array, launching one after the other
SPWs="1382~1383MHz 1383~1384MHz 1384~1385MHz 1385~1386MHz 1387~1388MHz 1388~1389MHz 1389~1390MHz 1391~1392MHz 1392~1393MHz 1393~1394MHz 1395~1396MHz"
arr=($SPWs)
cd ${arr[SLURM_ARRAY_TASK_ID]}

mpirun singularity exec  --bind /mnt:/mnt  /mnt/software/containers/casa-6.5.0-modular.sif  python ./processMeerKAT/crosscal_scripts/partition.py --config .config.tmp
cd ..

Any ideas?

Answer 1 · 2024-01-25T16:46:39.000Z

It might be worth trying to use the adaptation that I made in my fork that has been tested on a SLURM cluster in Manchester and Petrichor in Australia. The main difference is a config file for your local cluster. I haven't worked on the pipeline for a while though, and I think even this fork will become (or has become) stale. I don't recognise this error from my work with the pipeline.

Setting up a new cluster should be relatively easy if you follow the instructions in the readme.

Answer 2 · 2024-01-26T09:51:11.000Z

Hi @mb010, I fixed the problem yesterday. It was due to the python path and where I was calling the pipeline. And, Yes, I've been playing with your fork and I was able to run it, I had to fix some things in the imaging, but it seems to work correctly, although I had some problems with the python paths and the python path that were fixed. As I saw the original code and compared it with yours, I saw some differences and for that reason I wanted to test the original code with the latest updates and make the modifications to fix our cluster parameters. Now it seems to be going well, at least all the jobs have completed and there are no errors, but we are verifying that the imaging results are ok.
This is another question for a scientist like you, are the fits files created inside each SPW? I'm sorry, I'm an engineer before I'm a scientist :)
Thanks in advance.

Answer 3 · 2024-01-26T10:16:33.000Z

Glad to hear that its fixed.

Re the fork: Great. If you made any changes that generalise / fix anything, please open a PR somewhere so even if this becomes stale, we can see the most recent improvements across forks. 🙂 I would like to work with people sometime this year to make sure if this becomes stale we can at least have the most recent version public somewhere.

I'm not exactly sure what you mean with the fits files in the SPW. The SPW in this pipeline are computationally determined to parallelise the calibration across nodes. In practice final science images will either be channelised (usually with channel widths smaller than each of the SPW in this pipeline) or MFS images, at which point the final re-merged MMS resulting from this pipeline with calibrated visibilities should be imaged.

As a heads up: If I remember the imaging scripts do not produce final "science images". I wrote my own scirpts you can find here.
The images that the IDIA pipeline produces are (to my understanding) test images to make sure the calibration didn't break.

Answer 4 · 2024-01-26T10:59:31.000Z

Thanks @mb010 , I'm happy to collaborate to make everything more up to date using your approach.
As for the images, yes, that's what I meant, the scientific image. So this science_image.sbatch file ¿refers to what you indicate about the pipeline as a calibration check?

I've been looking at your code now to see how it integrates into the pipeline results I already have. That is, how I adapt it to my results to run your code. This is my pipeline structure right now after the last run:

I see in the code https://github.com/mb010/MIGHTEE-POL_imaging/blob/main/split/merge_spw.sh that it needs a .mms file, like export VIS=/share/nas2/MIGHTEE/calibration/cosmos/1587911796_sdp_l0/1587911796_sdp_l0.4k.J1000+0212.mms #COSMOS test, but after run the pipeline I have imaging files within the SPWs, in a subfoldernamedimages`:

And in this folder I have the fits files:

So, basically, I'm in this point, but reading this: https://idia-pipelines.github.io/docs/processMeerKAT/science-imaging-in-processmeerkat/ I'm not sure if this pipeline creates the science imaging. What do you think? Thanks in advance.

Answer 5 · 2024-01-26T11:14:47.000Z

Yeah, this pipeline won't make science images. The selection of imaging parameters is beyond the scope of this pipeline and is dependant on the science case.

This line in the README is I think what you might be missing conceptually. If I remember correctly, this MMS which is initially used to construct the SPW folders will contain the calibrated data at the end of the pipeline run. (If this isn't true, I'll have to have a think, but at least that's how I remember it working. The calibrated visibilities should be contained within that MS / MMS and those should be used for imaging / science. The SPW are then "nuisance" data and can be removed if you are confident (or stored if you like hoarding intermediaries like I do). Of course if you only want to image one of the SPWs as the split already is, then you can use the MMS within one of the SPW folders.

All of the fits files / imaging artefacts in the SPW folders are for validation (they may even come from a depreciated quick imaging script within the pipeline).

I have to my knowledge never used science_image.py (or the respective slurm script). It looks to be configurable, but I would probably suggest imaging separately depending on the science case.

Answer 6 · 2024-01-26T12:27:29.000Z

Great!, well let me now run your fork and branch: https://github.com/mb010/pipelines/tree/HPC_parameter (or the main brach?)
And I will let you know. Just to test if

Answer 7 · 2024-01-26T12:35:21.000Z

Yes this was the issue that I found with the fork (branch HPC_Parameter):
First I run processMeerKAT.py -vv -B -C imaging-line-config.txt -M /mnt/slurm-jobs/data-reduction-hcg97-av16sch4.ms -2 -I and then:

processMeerKAT.py -R -C imaging-line-config.txt
2024-01-26 12:31:25,490 INFO: Found HPC in config file: unknown
2024-01-26 12:31:25,490 INFO: unknown
2024-01-26 12:31:25,490 INFO: Setting option 'nodes' to '5096' from HPC 'unknown' config.
2024-01-26 12:31:25,490 INFO: Setting option 'ntasks_per_node' to '1024' from HPC 'unknown' config.
2024-01-26 12:31:25,490 INFO: Setting option 'mem' to '5000' from HPC 'unknown' config.
2024-01-26 12:31:25,490 INFO: Setting option 'partition' to 'debug' from HPC 'unknown' config.
2024-01-26 12:31:25,490 INFO: Setting option 'scripts' to '[('validate_input.py', False, ''), ('flag_round_1.py', True, ''), ('calc_refant.py', False, ''), ('setjy.py', True, ''), ('xx_yy_solve.py', False, ''), ('xx_yy_apply.py', True, ''), ('flag_round_2.py', True, ''), ('xx_yy_solve.py', False, ''), ('xx_yy_apply.py', True, ''), ('split.py', True, ''), ('quick_tclean.py', True, '')]' from HPC 'unknown' config.
2024-01-26 12:31:25,490 INFO: Setting option 'precal_scripts' to '[('calc_refant.py', False, ''), ('partition.py', True, '')]' from HPC 'unknown' config.
2024-01-26 12:31:25,490 INFO: Setting option 'postcal_scripts' to '[('concat.py', False, ''), ('plotcal_spw.py', False, ''), ('selfcal_part1.py', True, ''), ('selfcal_part2.py', False, ''), ('science_image.py', True, '')]' from HPC 'unknown' config.
2024-01-26 12:31:25,490 INFO: Setting option 'modules' to '['openmpi/2.1.1']' from HPC 'unknown' config.
2024-01-26 12:31:25,490 INFO: Setting option 'mpi_wrapper' to 'mpirun' from HPC 'unknown' config.
2024-01-26 12:31:25,490 INFO: Setting option 'container' to '/mnt/software/containers/casa-6.5.0-modular.sif' from HPC 'unknown' config.
2024-01-26 12:31:25,490 INFO: Setting option 'account' to '' from HPC 'unknown' config.
2024-01-26 12:31:25,492 WARNING: HPC facility is not in 'known_hpc.cfg', reverting to 'unknown' HPC. You input 0. Pipeline will rely entirely on the specified arguemnts. No upper limits will be set. HPC specific selections within your config may cause pipeline runs to fail!
2024-01-26 12:31:25,501 WARNING: Unknown keys ['outlier_radius'] present in section [selfcal] in 'imaging-line-config.txt'.
2024-01-26 12:31:25,505 WARNING: Unknown keys ['pbband', 'pbthreshold', 'outlierfile'] present in section [image] in 'imaging-line-config.txt'.
Traceback (most recent call last):
  File "/mnt/slurm-jobs/pipelines-HPC_parameter/processMeerKAT/processMeerKAT.py", line 1663, in <module>
    main()
  File "/mnt/slurm-jobs/pipelines-HPC_parameter/processMeerKAT/processMeerKAT.py", line 1659, in main
    kwargs = format_args(args.config, args.submit, args.quiet, args.dependencies, args.justrun)
  File "/mnt/slurm-jobs/pipelines-HPC_parameter/processMeerKAT/processMeerKAT.py", line 1325, in format_args
    imaging_kwargs = get_config_kwargs(config, 'image', HPC_DEFAULTS['IMAGING_CONFIG_KEYS'.lower()])
  File "/mnt/slurm-jobs/pipelines-HPC_parameter/processMeerKAT/processMeerKAT.py", line 1619, in get_config_kwargs
    raise KeyError("Keys {0} missing from section [{1}] in '{2}'. Please add these keywords to '{2}', or else run [-B --build] step again.".format(missing_keys,section,config))
KeyError: "Keys ['specmode'] missing from section [image] in 'imaging-line-config.txt'. Please add these keywords to 'imaging-line-config.txt', or else run [-B --build] step again."

I solved it by adding specmode (specmode= 'cube') in the section image of the config file. I don't know if it correct to bypass this issue.

Answer 8 · 2024-01-26T13:33:29.000Z

Yeah. I think that's fine. 👍 I left it in a branch for the PR which has been stale for a while, but maybe we can move everything into one repo eventually.

Answer 9 · 2024-01-30T08:23:46.000Z

Thank you @mb010 ! it works partially. Now I have encountered other errors, in the concat phase (concat.sbatch). I'll put them in another issue and close this one. Thank you very much and keep in touch. Regards.

Answer 10 · 2024-01-30T08:42:50.000Z

@mb010 Could you review this issue that I've found? #62
Regards,
Manu.