Snakemake-Profiles/htcondor

Rule is executed successfully but still snakemake throws an error

FridolinHaag opened this issue · 1 comments

The following strange behavior occurs for me: jobs are executed successfully, but snakemake then still throws an error; with the --keep-incomplete option I can keep the created files, but this seems not a sound solution. The logs give me no indication of what is wrong. I am not sure this is an issue/bug that can actually be addressed in the htcondor profile, but perhaps you have an idea.

Snakemake version: snakemake-minimal 6.6.1; python-htcondor 9.1.1

Example:

Snakefile:

rule all:
    input: 
        expand("myout/file_{index}.txt", index = range(2))
        
rule create:
    output:
        "myout/file_{index}.txt"
    run:
        print("Creating file " + wildcards.index)
        with open(output[0], "w") as out:
            out.write(wildcards.index)
        print("Success creating file " + wildcards.index)

snakemake log with the error:

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cluster nodes: 5000
Job stats:
job       count    min threads    max threads
------  -------  -------------  -------------
all           1              1              1
create        2              1              1
total         3              1              1

Select jobs to execute...

(...)

[Tue Aug  3 10:45:10 2021]
rule create:
    output: myout/file_0.txt
    jobid: 1
    wildcards: index=0
    resources: tmpdir=/tmp

Submitted job 1 with external jobid '1_5c6be0e5-5c2d-4378-bb3f-fd344e84b552_139'.
(...)
[Tue Aug  3 10:45:18 2021]
Error in rule create:
    jobid: 1
    output: myout/file_0.txt
    cluster_jobid: 1_5c6be0e5-5c2d-4378-bb3f-fd344e84b552_139

Error executing rule create on cluster (jobid: 1, external: 1_5c6be0e5-5c2d-4378-bb3f-fd344e84b552_139, jobscript: /home/mypath/.snakemake/tmp._8ciql23/snakejob.create.1.sh). For error details see the cluster log and the log files of the involved rule(s).
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/mypath/.snakemake/log/2021-08-03T104507.463350.snakemake.log

But condor.err shows the job finished, and the files are actually created:

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 128
Rules claiming more threads will be scaled down.
Select jobs to execute...

[Tue Aug  3 10:45:14 2021]
rule create:
    output: myout/file_0.txt
    jobid: 0
    wildcards: index=0
    resources: mem_mb=1000, disk_mb=1000, tmpdir=/var/lib/condor/execute/dir_111320

[Tue Aug  3 10:45:14 2021]
Finished job 0.
1 of 1 steps (100%) done

condor.log

000 (139.000.000) 08/03 10:45:10 Job submitted from host: <10.10.1.2:9618?addrs=10.10.1.2-9618&noUDP&sock=9826_65ad_4>
...
001 (139.000.000) 08/03 10:45:13 Job executing on host: <10.0.81.101:9618?addrs=10.0.81.101-9618&noUDP&sock=2662_7d33_3>
...
006 (139.000.000) 08/03 10:45:14 Image size of job updated: 1250
	0  -  MemoryUsage of job (MB)
	0  -  ResidentSetSize of job (KB)
...
005 (139.000.000) 08/03 10:45:14 Job terminated.
	(1) Normal termination (return value 0)
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
	0  -  Run Bytes Sent By Job
	0  -  Run Bytes Received By Job
	0  -  Total Bytes Sent By Job
	0  -  Total Bytes Received By Job
	Partitionable Resources :    Usage  Request Allocated
	   Cpus                 :                 1         1
	   Disk (KB)            :     1250     1250   3913769
	   Gpus                 :                           4
	   Memory (MB)          :        0        2         2
...

Any ideas?

I started afresh creating a new profile in a new conda environment and the problem does not occur anymore. Possibly this was a file system problem as well. While it remains a bit mysterious, I think we can assume this was my user error and this issue can be deleted.