Rule is executed successfully but still snakemake throws an error
FridolinHaag opened this issue · 1 comments
The following strange behavior occurs for me: jobs are executed successfully, but snakemake then still throws an error; with the --keep-incomplete
option I can keep the created files, but this seems not a sound solution. The logs give me no indication of what is wrong. I am not sure this is an issue/bug that can actually be addressed in the htcondor profile, but perhaps you have an idea.
Snakemake version: snakemake-minimal 6.6.1; python-htcondor 9.1.1
Example:
Snakefile:
rule all:
input:
expand("myout/file_{index}.txt", index = range(2))
rule create:
output:
"myout/file_{index}.txt"
run:
print("Creating file " + wildcards.index)
with open(output[0], "w") as out:
out.write(wildcards.index)
print("Success creating file " + wildcards.index)
snakemake log with the error:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cluster nodes: 5000
Job stats:
job count min threads max threads
------ ------- ------------- -------------
all 1 1 1
create 2 1 1
total 3 1 1
Select jobs to execute...
(...)
[Tue Aug 3 10:45:10 2021]
rule create:
output: myout/file_0.txt
jobid: 1
wildcards: index=0
resources: tmpdir=/tmp
Submitted job 1 with external jobid '1_5c6be0e5-5c2d-4378-bb3f-fd344e84b552_139'.
(...)
[Tue Aug 3 10:45:18 2021]
Error in rule create:
jobid: 1
output: myout/file_0.txt
cluster_jobid: 1_5c6be0e5-5c2d-4378-bb3f-fd344e84b552_139
Error executing rule create on cluster (jobid: 1, external: 1_5c6be0e5-5c2d-4378-bb3f-fd344e84b552_139, jobscript: /home/mypath/.snakemake/tmp._8ciql23/snakejob.create.1.sh). For error details see the cluster log and the log files of the involved rule(s).
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/mypath/.snakemake/log/2021-08-03T104507.463350.snakemake.log
But condor.err shows the job finished, and the files are actually created:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 128
Rules claiming more threads will be scaled down.
Select jobs to execute...
[Tue Aug 3 10:45:14 2021]
rule create:
output: myout/file_0.txt
jobid: 0
wildcards: index=0
resources: mem_mb=1000, disk_mb=1000, tmpdir=/var/lib/condor/execute/dir_111320
[Tue Aug 3 10:45:14 2021]
Finished job 0.
1 of 1 steps (100%) done
condor.log
000 (139.000.000) 08/03 10:45:10 Job submitted from host: <10.10.1.2:9618?addrs=10.10.1.2-9618&noUDP&sock=9826_65ad_4>
...
001 (139.000.000) 08/03 10:45:13 Job executing on host: <10.0.81.101:9618?addrs=10.0.81.101-9618&noUDP&sock=2662_7d33_3>
...
006 (139.000.000) 08/03 10:45:14 Image size of job updated: 1250
0 - MemoryUsage of job (MB)
0 - ResidentSetSize of job (KB)
...
005 (139.000.000) 08/03 10:45:14 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
Partitionable Resources : Usage Request Allocated
Cpus : 1 1
Disk (KB) : 1250 1250 3913769
Gpus : 4
Memory (MB) : 0 2 2
...
Any ideas?
I started afresh creating a new profile in a new conda environment and the problem does not occur anymore. Possibly this was a file system problem as well. While it remains a bit mysterious, I think we can assume this was my user error and this issue can be deleted.