Differences in SLURM command between what Snakemake says it runs and what is actually run
Opened this issue · 8 comments
Dear developers
I have a rule that produces unexpected results when submitting to slurm (Snakemake version: 8.10.0). So, I turned on verbose in order to double-check that the command was formatted correctly. According to Snakemake, the following command was submitted to slurm:
gatk SplitIntervals -R some_file -L some_file --scatter-count 32 --subdivision-mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION -O some_folder
Job 11 has been submitted with SLURM jobid 633204 (log: /.../633204.log).
Then I opened the referenced 633204.log and here the command looked like:
gatk SplitIntervals -R some_file -L some_file --scatter-count 1 --subdivision-mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION -O some_folder
Notice how the --scatter-count option has changed from 32 to 1, which explains the unexpected results. This number is based on available cores and is calculated like this in my workflow:
large_job_cores = max(1, math.floor(workflow.cores / len(SAMPLES)))
Since this number has apparently changed between the main workflow and the slurm submission, I wonder if Snakemake is run a second time for each slurm submission? If so, workflow.cores (and maybe other parameters) should be transferred somehow from the main workflow to the slurm submission.
Thank you for all your hard work!
Cheers!
Hi,
Thanks for the feedback!
The SLURM executor plugin is oblivious to whatever you write in your Snakefile, except for threads
and cpus_per_task
(the latter usually defined as a resource in a workflow profile).
So, how is large_job_cores
propagated further? Please show us that code to elaborate whether we are dealing with a bug, indeed. Thank you!
Thanks, certainly.
large_job_cores is used to control multithreading in rules that launch multi-threaded programs. In addition, parallel computing is also achieved for certain single-threaded jobs by splitting input files in chunks before running them. The latter is what SplitIntervals is for and it creates as many chunks as you instruct it to via the --scatter-count option.
I don't know if it will help to shed some light on this, but here is how large_job_cores is used in that rule:
large_job_cores = max(1, math.floor(workflow.cores / len(SAMPLES)))
INTERVAL_SUBDIVISION_NUMBERS = [str(i).zfill(4) for i in range(large_job_cores )]
rule subdivide_intervals_file:
input:
intervals_file = rules.create_intervals_file.output.intervals_file,
ref = rules.prepare_genome_files.output.fasta_ref,
ref_index = rules.prepare_genome_files.output.ref_index,
ref_dict = rules.create_reference_dict.output.ref_dict
output:
scattered_interval_lists = expand(os.path.join(intervals_folder,'{subdivision}scattered.interval_list'),subdivision=INTERVAL_SUBDIVISION_NUMBERS)
conda:
conda_environment
params:
intervals_folder = intervals_folder
log:
os.path.join(logs,'subdivide_intervals_file.log')
shell:
"""
gatk SplitIntervals \
-R {input.ref} \
-L {input.intervals_file} \
--scatter-count {large_job_cores } \
--subdivision-mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION \
-O {params.intervals_folder} \
> {log} 2>&1
"""
Since large_job_cores has a direct consequence for the expected output files of this rule, Snakemake throws a missing files error . That is, on a workflow run where large_job_cores evaluates to 32, Snakemake says it submits the correct job command, but as the slurm log shows, this number has been changed to 1. So Snakemake expects 32 files, but only one gets produced. For the rules that runs multithreaded programs, i expect that they will also run with just one core instead of the expected 32. However, no missing files error arise in those cases.
In any case, I expect that differences between what Snakemake says it sends to slurm and what is actually sent is not something one would want. Thank you for looking into this.
Cheers.
OK, thank you. This is no bug in the executor, but rather one in your Snakefile. Let me explain:
You instruct your executable to use a number of resources. Yet, upon submission, nothing will be reserved. I suggest changing to either
rule subdivide_intervals_file:
<snip>
resources:
cpus_per_task: large_job_cores
<snip>
or
rule subdivide_intervals_file:
<snip>
threads: large_job_cores
<snip>
However, as large_job_cores
presumably gets calculated on a login node, you might want to consider that your assumption does not hold, when the number of cores is different there compared to compute nodes. To ensure portability of your development, you better perform the calculation within the rule, e.g.:
rule subdivide_intervals_file:
<snip>
threads: max(1, math.floor(workflow.cores / len(SAMPLES)))
shell: "gatk ... --scatter-count {threads} ..."
I just hope, I did not make a mistake for obviously my advice is untested.
Cheers
Christian
Thank you, Christian.
I am not sure I agree, though. I believe the error originates (or is exemplified) in the property, workflow.cores, which appears to have two evaluations when running the workflow with the slurm executor:
In the context of my data, the first evaluation seems to be when Snakemake builds the DAG and figures out the names of input and output files. Here it is evaluated to the expected 256 as evident from the verbose output and the list of expected output files from subdivide_intervals_file. This rule, by the way, always runs single-threaded, but the filenames it is supposed to output depend on the evaluation of workflow.cores, which reflects on large_job_cores.
The second evaluation seems to be when the job is submitted to slurm where it appears to take on a lower value. It is possible that it gets calculated on a login node, as you suggest, but if that is indeed the case, isn't that a problem?
If this is the expected behavior, then it means that workflow.cores cannot be used when slurm is used as the executor. I hope this is not the intended behavior, but if it is, it should be clearly indicated in the documentation.
Cheers,
Steffen
Hi,
I am not sure I agree, though.
Try the following:
#!/bin/bash
#SBATCH ... # all the settings for your job submissions
#SBATCH -c1
echo "I want to run with n-threads, but am confined to ${SLURM_CPUS_PER_TASK} threads."
Whatever is asked for at submit time, is available in the job context - if physically possible and not otherwise restricted by settings. When you calculate your max. number of threads on the login node and not use it for submission (by means of Snakemakes resources), you will not get the resource in the job context. If you recalculate in the job context, depending on the tools you use, you will get different numbers, e.g. the available threads in the c-group, the number of CPUs on the node, etc.. This is not reliable. The workflow.cores
-Variable will yield different values, too: Snakemake essentially submits itself. Hence the re-evaluation.
Hmm, thanks but I fail to understand how this addresses my disagreement. So let me try to be more specific as to why I disagree and hopefully you will take a another fresh look at this. Thank you for taking the time.
I don't agree that there is a bug in my snakefile. Why? Because it works and runs as expected if I do not reference workflow.cores. Conclusion: workflow.cores is the culprit here and we can sweep everything else off the table and focus on that.
And just to be clear: workflow.cores is mentioned in several places in the main Snakemake documentation as a resource that is available to the user.
Another important remark is this: workflow.cores does NOT represent an evaluation of the available number of cores in the compute environment. Instead, it reflects the requested number when snakemake is called. That is, if I call snakemake like this on a single core machine:
snakemake -s snakefile.smk -c1000
the value of workflow.cores is not 1 but 1000
So, speculations on the influence of the login node can also be set aside.
The question now is this: Why does workflow.cores get redefined, when using slurm as the executor? It shouldn't get redefined.
I don't have any experience with the Snakemake codebase and I am not familiar with dataclasses, but a quick comparison between the code for snakemake-executor-plugin-slurm and the skeleton for executor plugins reveals that the workflow seems to get referenced in the skeleton's _post_init_ function in the Executor, but not in snakemake-executor-plugin-slurm.
Thus, I may be completely wrong but based on this, my best guess is that workflow.cores is undefined and therefore, a default value of 1 is used.
Cheers.
Sorry, I should have been more elaborate:
The idea is to trigger a workflow run on the login node. This is the first invocation of Snakemake. workflow.cores
is evaluated to a number. Some jobs, those not defined as localrules
, get submitted. This is where the jobstep-executor plugin kicks in. It is the second invocation of Snakemake per (group) job. workflow.cores
is re-evaluated.
Now, cluster admins might treat this differently, but the default is, that SLURM sets it c-group according to values present at submit time (we will not consider a case where a job is rejected due to faulty parameters). Basically, there are two scenarios:
- the c-group is most likely smaller than your "allowance" on the login node (for a shared memory application is
cpus_per_task
big, which is equivalent tothreads
- any setting is fine). Hence, the apparent(!) core count based onworkflow.cores
will be lowered. - if the calculation is not based on the c-group, the compute node might still be different in terms of core number (compared to the login node - let's disregard the little but important difference of cores and CPUs for this discussion).
In both cases, we suffer from a deviation for the apparent cores. As your count for workflow.cores
is 1 in job context, I presumed you suffer from scenario one.
This is why upon workflow design we need to ensure that whatever we want to use in terms of resources needs to be reserved. And it is the reason, too, why we ought to store any resource calculation as a parameter of the rule.
But yes, this limits the use case of workflow.cores
somewhat. I will put it in the documentation of this plugin!