CARACal INFO: exiting with error code 1
Closed this issue · 18 comments
Hi,
I'm trying to run CARAcal on a slurm system that is not ilifu. As far as I know it's "correctly" installed (although I did not do it). Nonetheless, something is clearly wrong because I get the following error and I don't know what to do with it. The .stimela_workdir-16938348029524467
directory is present while caracal is running at some point, but then disappears by the time it crashes out with this error message. I tried export SINGULARITY_PULLFOLDER=/scratch/users/putyourusernamehere/STIMELA_IMAGES_NEW
recommendation from this issue, but it made no difference: #1087
2023-09-04 15:45:20 CARACal.Stimela.summary_json-ms0-0 ERROR: cd /cephyr/NOBACKUP/groups/hess/antlia/hi/HI_2023/1_HI_caracal/1613847072/.stimela_workdir-16938348029524467 && singularity run --userns --workdir /cephyr/NOBACKUP/groups/hess/antlia/hi/HI_2023/1_HI_caracal/1613847072/.stimela_workdir-16938348029524467 --containall returns error code 1
2023-09-04 15:45:20 CARACal.Stimela.summary_json-ms0-0 ERROR: job failed at 2023-09-04 15:45:20.739173 after 0:00:28.085690
2023-09-04 15:45:21 CARACal ERROR: Job 'summary_json-ms0-0:: Get observation information as a json file ms=1613847072_sdp_l0_HI-cal.ms' failed: cd /cephyr/NOBACKUP/groups/hess/antlia/hi/HI_2023/1_HI_caracal/1613847072/.stimela_workdir-16938348029524467 && singularity run --userns --workdir /cephyr/NOBACKUP/groups/hess/antlia/hi/HI_2023/1_HI_caracal/1613847072/.stimela_workdir-16938348029524467 --containall returns error code 1 [PipelineException]
2023-09-04 15:45:21 CARACal INFO: More information can be found in the logfile at output_1613847072/logs-20230904-153954/log-caracal.txt
2023-09-04 15:45:21 CARACal INFO: exiting with error code 1
Thanks in advance for your help.
Just a bit more info about the issue we're facing (I did the install broadly following what's on github). This is a system that runs apptainer, all stimula-images live in a directory pointed at via ${CARACAL_IMAGES}. The pipeline runs in an environment with python 3.9.6 and is invoked as
caracal -c config.yml -ct singularity -sid ${CARACAL_IMAGES}
As far as I can tell, the singularity images are found and individual tasks run but the return code from apptainer is (mis-?) interpreted as an error. If one re-runs the same caracal command again, it finds the previously generated output, skips that step and moves on to the next task. This also runs to completion but seems to return an error. In the attached log files, one can see that it takes three caracal runs for the pipeline to finish. The output of listobs, summary_json-ms0, and elevation-plots-ms0 all look fine -- despite the 'exit code 1' messages.
I found a vaguely related issue #1361 but in our case it's not the ctrl-c'ed singularity images.
The config looks like this:
`
schema_version: 1.0.4
general:
title: ''
rawdatadir: '/cephyr/NOBACKUP/groups/hess/antlia/hi/HI_2023/0_HI_raw'
msdir: msdir
input: input
output: output
prefix: '1613847072' # !!! rename for each data set
final_report: false
getdata:
dataid: ['1613847072_sdp_l0_HI']
obsconf:
obsinfo:
enable: true
target:
- all
fcal:
- 'longest'
bpcal:
- 'longest'
gcal:
- 'all'
xcal:
- 'longest'
refant: 'm032'
`
logs-20230904-171450.txt
logs-20230904-172054.txt
logs-20230904-171822.txt
Have you tried with CARACal's latest stable release?
That's v1.0.7, is it? -- not yet.
Yep that's right -- with an unfortunate error here
Line 30 in 84299e3
It's actually not entirely clear to me which version we're running. pip tells me it's 1.1.1, caracal --version
tells me it's 1.0.6 as also seen in the logs.
mmmm... I let others comment on this, I'm not the best in the team with software versions, pip, etc etc (as the above error shows)
No joy with v1.0.7 -- exactly the same issue.
logs-20230905-103639.txt
logs-20230905-103933.txt
logs-20230905-104124.txt
Which Stimela version are you running? I would suggest to make one last try with Stimela 1.7.6, which is the stable release I use succesfully with Caracal 1.0.7.
Sorry for the pain ...
I was running 1.7.8, downgraded to 1.7.6. No joy :-( Same issue.
Is it possible to get more info from the singularity images than just 'finished with exit code 1'? The temporary work directory ./stimela_workdir-<random_number_string>
disappears right after the pipeline ran such that one cannot re-run the command that threw the error by oneself. Using the --debug
flag didn't really help much either yet.
I now reverse-engineered what caracal is doing in terms of running singularity images.
I now ran caracal again with -debug
enabled as I realized that one gets to see the actual singularity command that is run. Once the script drops into pdb
, I copy/rsync the content of .stimela_workdir-<random-number>
somewhere else before exiting pdb. I then copy that directory back to where it was before with the same name as before. I then define a bunch of environment variables that are being reported as set (e.g. export SINGULARITYENV_STIMELA_MOUNT=/stimela_mount; export SINGULARITYENV_OUTPUT=${SINGULARITYENV_STIMELA_MOUNT}/output
and so on). Then I rerun the singularity command as I see it from the pipeline, i.e.
cd /cephyr/NOBACKUP/groups/hess/franz/caracal-tests/HI_2023_1_HI_caracal_1613847072/.stimela_workdir-16939197495247493 && singularity run --workdir /cephyr/NOBACKUP/groups/hess/franz/caracal-tests/HI_2023_1_HI_caracal_1613847072/.stimela_workdir-16939197495247493 --containall --bind /cephyr/NOBACKUP/groups/hess/franz/caracal-tests/HI_2023_1_HI_caracal_1613847072/.stimela_workdir-16939197495247493/stimela_parameter_files/elevation_plots_ms0-2304369160654416939197497773852.json:/stimela_mount/configfile:ro --bind /cephyr/NOBACKUP/groups/hess/franz/software/caracal-1.0.7/lib/python3.9/site-packages/stimela/cargo/cab/owlcat_plotelev/src:/stimela_mount/code:ro --bind /cephyr/NOBACKUP/groups/hess/franz/caracal-tests/HI_2023_1_HI_caracal_1613847072/.stimela_workdir-16939197495247493/passwd:/etc/passwd:rw --bind /cephyr/NOBACKUP/groups/hess/franz/caracal-tests/HI_2023_1_HI_caracal_1613847072/.stimela_workdir-16939197495247493/group:/etc/group:rw --bind /cephyr/NOBACKUP/groups/hess/franz/software/caracal-1.0.7/bin/stimela_runscript:/singularity:ro --bind /cephyr/NOBACKUP/groups/hess/antlia/hi/HI_2023/0_HI_raw:/stimela_mount/msdir:rw --bind /cephyr/NOBACKUP/groups/hess/franz/caracal-tests/HI_2023_1_HI_caracal_1613847072/input:/stimela_mount/input:ro --bind /cephyr/NOBACKUP/groups/hess/franz/caracal-tests/HI_2023_1_HI_caracal_1613847072/msdir:/stimela_mount/output:rw --bind /cephyr/NOBACKUP/groups/hess/franz/caracal-tests/HI_2023_1_HI_caracal_1613847072/msdir/tmp:/stimela_mount/output/tmp:rw /cephyr/NOBACKUP/groups/hess/franz/software/caracal-images2/stimela_owlcat_1.6.6.img /singularity
As expected, this finishes without errors and I get a nice new elevation plot.
IMHO this has nothing to do with singularity or apptainer. But I have a hard time figuring out where else to look now. I am attaching the latest log that I have with [log-caracal.txt](https://github.com/caracal-pipeline/caracal/files/12525157/log-caracal.txt)
enabled.
@SpheMakh I think we need your input here
We might have solved the issue. Until now, I first loaded the python 3.9.6 module via module load python...
and then created the virtual environment. This puts a bunch of things into $LD_LIBRARY_PATH
and ${PATH}
.
If I do not load the module but instead use the system python (3.6.8) to create the virtual env and then build caracal (v1.0.7, stimela version 1.7.9), everything runs just fine. I am attaching here what's in env
for the case when loading python as a module and when using the system python. Maybe that helps to find the problem.
module_load_python_3.9.6_env.txt
system_python_3.6.8_env.txt
Haha, fun fact: it would appear somebody updated stimela to version 1.7.9 since yesterday. This comes with a bunch of new singularity images. That only coincided with my system-python test leading me to false conclusions as the system-python build of caracal pulled stimela 1.7.9... Everything works fine with stimela_1.7.9 -- both the system-python and the module-load-python.
OK @pharaofranz , many thanks for the detailed reporting. Maybe one of the Stimela folks could comment on this before we close the issue?
Hi @pharaofranz, thank you for the issue.
I can confirm that the latest stimela-classic
version 1.7.9 added support for apptainer/singularity.
I will also attempt to reproduce this error to see why it occurred.
@francescaLoi may have reported a related issue due to a different environment setup.
I'm preparing a pre-release of caracal
with the latest updates, and any apptainer-related issues can be tested against this.
Please re-open if experiencing the issue.