Snakemake-Profiles/slurm

Provided cores differ to what's passed in the CLI

Opened this issue · 1 comments

Hi,

first, thank you for providing this profile. It's extremely useful.

I'm having a problem to launch multiple jobs at the same time. For example, I want to launch 5 jobs at the same time, each with 64 cores.

If I run snakemake --cores 64, I find that the jobs get launched sequentially rather than in parallel. I understand that this is because I requested a "maximum" of 64 cores, and thus if a job takes that up, I can only run one at a time.

Now, I wrote a function that is passed to the rules threads directive which multiples the workflow.cores by, say, 0.2. So I can pass snakemake --cores 320 and each rule will be allocated 64 cores. However, I am finding that somehow this is getting "squared". What happens is:

  • the Snakemake STDOUT (what is shown in the screen) shows the correct number of threads:
rule map_reads:
    input: output/mapping/H/catalogue.mmi, output/qc/merged/H_S003_R1.fq.gz, output/qc/merged/H_S003_R2.fq.gz
    output: output/mapping/bam/H/H_S003.map.bam
    log: output/logs/mapping/map_reads/H-H_S003.log
    jobid: 247
    benchmark: output/benchmarks/mapping/map_reads/H-H_S003.txt
    reason: Missing output files: output/mapping/bam/H/H_S003.map.bam
    wildcards: binning_group=H, sample=H_S003
    threads: 64
    resources: tmpdir=/tmp, mem_mb=149952


    minimap2 -t 64 > output/mapping/bam/H/H_S003.map.bam

Submitted job 247 with external jobid '35894421'.

That looks fine. I want this rule to be launched with 64 cores, and when I do this, 5 instances of the rule get launched at the same time.

When I open the job's SLURM log, however, I find that this value of 64 is passed as the "Provided cores" to the job, and thus is multiplied again by 0.2.

Contents of slurm-35894421.out:

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 64
Rules claiming more threads will be scaled down.
Select jobs to execute...

rule map_reads:
    input: output/mapping/H/catalogue.mmi, output/qc/merged/H_S003_R1.fq.gz, output/qc/merged/H_S003_R2.fq.gz
    output: output/mapping/bam/H/H_S003.map.bam
    log: output/logs/mapping/map_reads/H-H_S003.log
    jobid: 247
    benchmark: output/benchmarks/mapping/map_reads/H-H_S003.txt
    reason: Missing output files: output/mapping/bam/H/H_S003.map.bam
    wildcards: binning_group=H, sample=H_S003
    threads: 13
    resources: tmpdir=/tmp, mem_mb=149952


    minimap2 -t 13 > output/mapping/bam/H/H_S003.map.bam

Even worse, my job is allocating 64 cores, but only using 13 (64 * 0.2, rounded). It's really weird to me that the Snakemake output shows the "correct" value, but the SLURM log shows the "real" value that was used, i.e. why do they differ?

I am trying to understand what am I doing wrong. Setting a breakpoint on my function used to get the number of threads, the workflow.cores variable is always what I pass to the command line (320), never what shows in the SLURM log.

I tried add a nodes: 5 or a jobs: 5 keys to the profile config.yaml but it doesn't do any good. Is there anything I can modify in the profile to make sure that I can launch as many parallel jobs as I can?

Please let me know what other information I can provide. Thank you very much.

Best,
V

Update:

I found a temporary fix by using Snakemake's max-threads option. However, I am still trying to understand what would be the best practice for my scenario, and why is my function call "doubled", that is, why does the SLURM log differs from the Snakemake output by showing the "Provided cores" as the number of threads passed to the rule (and the number of threads gets calculated again from the Provided cores). Thank you!