Snakemake-Profiles/slurm

slurm_utils.py submit_job function is very slow

atermanini opened this issue · 13 comments

Hello,

I am using this cookiecutter template with the last snakemake and SLURM.

For pipelines that spawn jobs that takes a small amount of time to be executed (eg 30 or less seconds), even spawning 1000 independents jobs I cannot reach more than 30-40 running jobs per time. Therefore, I am not able to use the full power of the HPC.

After many checks, I found that snakemake takes at least 1 seconds to spawn a new job.

The majority of this time is taken by the last code line in the slurm-submit.py script, which takes around 0.7 seconds.
print(slurm_utils.submit_job(jobscript, **sbatch_options))

Going to the slurm_utils.py, and inspecting the submit_job function with some timing checks, I can confirm that this is where the majority of the time is taken.

Is there any way to optimize this function?

Here we are using several pipelines that spawn thousand of independent jobs, and that takes no more than 30 seconds to be completed. If snakemake is spawning a jobs each second, it is difficult to take advantage of our cluster.

Please let me know if I can help with more details / code to replicate the problem.

Thanks very much,

Best,
Alberto

Hi @atermanini ,
I quickly checked the code in the function and I do not see any possible source of this slowness except the call to SLURM itself.
Can you check what time it takes to submit a job to slurm with the sbatch command:

/usr/bin/time -v sbatch myscript.sbatch

I am afraid yourr slurm system is a bit sluggish.

Hi @maarten-k

thanks very much for your help.

The results are:

real     0m0.060s
user    0m0.004s
sys     0m0.000s

On our system the submission with sbatch is fast.
Also, by running snakemake without the cookiecutter profile, the submission of jobs is fast
(--cluster sbatch --partition cpuq --time 00:10:00 --mem 1G --nodes 1 --ntasks 1 --cpus-per-task 1)

I am experimenting this delay of near one second only with the cookiecutter template.

For debugging, I got the jobscript from the temporary folder of snakemake (snakejob.align_seq.2.sh), and I also printed the "cmd" variable which is passed to the sp.check_output(cmd) inside the submit_job function:

['sbatch', '--parsable', '--nodes=1', '--ntasks=1', '--cpus-per-task=1', '--mem=1000', '--time=00:10:00', '--partition=cpuq', '--job-name=align_seq', '--output=slurm_logs/align_seq/slurm-%j.out', '--error=slurm_logs/align_seq/slurm-%j.err', '/home/alberto.termanini/tests/snakemake_tests/pipeline_05/.snakemake/tmp.gspor3wl/snakejob.align_seq.2.sh']

less snakejob.align_seq.2.sh

#!/bin/bash
# properties = {"type": "single", "rule": "align_seq", "local": false, "input": ["data/0001.fastq"], "output": ["sam/0001.sam"], "wildcards": {"sample": "0001"}, "params": {}, "log": [], "threads": 1, "resources": {"tmpdir": "/tmp", "mem_mb": 1000, "time": "00:10:00"}, "jobid": 2, "cluster": {}}
 cd /home/alberto.termanini/tests/snakemake_tests/pipeline_05 && \
/home/alberto.termanini/conda_envs/snakemake/bin/python3.9 \
-m snakemake sam/0001.sam --snakefile /home/alberto.termanini/tests/snakemake_tests/pipeline_05/Snakefile \
--force --cores all --keep-target-files --keep-remote --max-inventory-time 0 \
--wait-for-files /home/alberto.termanini/tests/snakemake_tests/pipeline_05/.snakemake/tmp.31efp_wf data/0001.fastq --latency-wait 90 \
 --attempt 1 --force-use-threads --scheduler ilp \
--wrapper-prefix https://github.com/snakemake/snakemake-wrappers/raw/ \
  -p --allowed-rules align_seq --nocolor --notemp --no-hooks --nolock --scheduler-solver-path /home/alberto.termanini/conda_envs/snakemake/bin \
--mode 2  --default-resources "tmpdir=system_tmpdir"  && exit 0 || exit 1

Here is my testing pipeline code

less Snakefile

from os.path import join

DATA_DIR = 'data'
SAMPLES = glob_wildcards(join(DATA_DIR, "{sample}.fastq")).sample

rule all:
    input:
        'results/summary.txt'
        
rule make_summary:
    input:
        expand("sam/{sample}.sam", sample = SAMPLES)
    output:
        'results/summary.txt'
    threads: 1    
    resources:
        mem_mb=500,
        time="00:02:00"
    shell:
        'ls {input} | wc -l > {output}'

rule align_seq:
    input:
        join(DATA_DIR,'{sample}.fastq')
    output:
        'sam/{sample}.sam'
    threads: 1
    resources:
        mem_mb=1000,
        time="00:10:00"
    shell:
        "sleep 10s; "
        "echo {input} > {output}"

The files are just empty files generated with:
touch data/{0001..1000}.fastq

I hope you can help me finding why this is happening.

Thanks again,
Best, Alberto

Looks like slurm is not the culprit and to be honest I do not know for now

Going to the slurm_utils.py, and inspecting the submit_job function with some timing checks, I can confirm that this is where the majority of the time is taken.

Did you profile the submit_job function in a bit more detail ?

Hi @maarten-k ,

by profiling the submit_job function (just printing the difference between datetime.now() before and after each block of code), I saw that the command:

res = sp.check_output(cmd)

is taking the majority of the time: from 0.6 to 1 or seconds (see screenshot in attachment).
This is happening both executing snakemake on an head node of the cluster or inside an sbatch script:

0:00:00.059181 0:00:00.074783 0:00:00.064700 0:00:00.059625 0:00:00.050544 0:00:00.053404 0:00:00.055185 0:00:00.055678 0:00:00.054610 0:00:00.050749 0:00:00.054997 0:00:00.081413 0:00:00.094667 0:00:00.055244 0:00:00.051311 0:00:00.110717 0:00:00.119001 0:00:00.053417

Usually, the first three-four jobs that are spawned has even higher times (2-3 seconds), then timing goes down and stay between 0.5 - 1 second. I don't know if this can be helpful: I am running snakemake inside a conda environment which is installed in my home folder.

Please let me know if I can make some more tests.

Thanks again,
Alberto
MicrosoftTeams-image
Senza titolo

The slowness is because each submission starts a new Python interpreter to run slurm-submit.py. See #73

Hi @frankier,

I am not sure that the problem is the opening of a new Python interpreter. For sure this takes time, however the slurm-submit.py takes around 0.7 second to be executed, and 0.6 seconds (90%) is taken by the "res = sp.check_output(cmd)" command, therefore I think that it is something more specific to this function call.

What do you think?

Best,
Alberto

Looks like I jumped the gun there since it seems like that's not where your slowness is. Did you try the alternative profile anyway? Whether it works faster or not might provide extra information.

Hi @frankier ,

I will try the alternative profile suggested and let you know.

Thanks again,
Alberto

HI @frankier ,

with the "Simple SLURM" and status-sacct.sh the submission of jobs was more fast, I was able to have a peak of around 470 running jobs. Each job executes a simple "sleep 30s" shell command. The jobs are 1000 in total. The execution seems to be stable.

Thanks,
Alberto

Hi @atermanini , sorry for the delay. Thank you for your detailed report and the identification of the bottleneck you are experiencing when submitting your jobs. I will try to find some time to look into this. I personally haven't really had any problems with the submissions being a bit slow for the simple reason that I haven't dealt with the number of jobs that you are.

Cheers,

Per

For pipelines that spawn jobs that takes a small amount of time to be executed (eg 30 or less seconds), even spawning 1000 independents jobs I cannot reach more than 30-40 running jobs per time. Therefore, I am not able to use the full power of the HPC.

@atermanini This was the exact same problem I was having when I decided to create smk-simple-slurm. I knew from previous experience with Snakemake prior to the introduction of the --profile flag that it had the potential to submit jobs much faster.

The slowness is because each submission starts a new Python interpreter to run slurm-submit.py.

@frankier Thanks for sharing! My working assumption is still that all the extra Python invocations are causing the slow down, but I admit I didn't perform a detailed diagnosis of the exact bottlenecks. I just assumed that Snakemake (ie Python) shelling out to Bash just to re-invoke Python again over and over (for each job submission and each time each job status is checked) was contributing to the problem.

In case it could be useful for comparing submission speeds, I created the benchmark below to convince myself that Snakemake could submit jobs faster (ie that Slurm itself wasn't the culprit).

https://github.com/jdblischak/smk-simple-slurm/tree/7c54cba6ff59a186d8f33d4a14a535607b980868/examples/jobs-per-second

Just chiming in - I am new to this profile (I am the developer of the LSF profile but have just moved to an institute that uses slurm).

If you have advanced argument conversion on I believe there is an additional call to sinfo that is being made also to check the limits for the partition/cluster

cmd = f"sinfo -e -o %all -p {partition} {cluster}".split()
try:
output = sp.Popen(" ".join(cmd), shell=True, stdout=sp.PIPE).communicate()
except Exception as e:
print(e)
raise

I have no profiled with and without this advanced argument parsing, but it must surely add to overhead?

I guess a first question would be: is this needed? I kind of feel that if a user is asking for time longer than the partition allows, that's their issue not this profile's?

@mbhall88 yes the advanced argument hack with multiple calls to sinfo call is certainly not ideal. Storing the cluster configuration in a file and loading once would be one way to improve this step. Regarding whether it is needed or not, the initial motivation was to solve cases where time/memory was automatically increased (e.g. doubled in a lambda function) on failed jobs. Admittedly, this need has proven less evident with the new flags to set resources. I'll mark it for reworking/removal.

Still, the issue with slow submission for many short jobs remains.