snakemake/snakemake-executor-plugin-slurm

Setting memory twice when submitting with slurm executor

Opened this issue · 14 comments

Versions

snakemake version 8.10.7
snakemake-executor-plugin-slurm version 0.4.4
snakemake-executor-plugin-slurm-jobstep version 0.2.1

The problem

I am working on getting snakemake version 8 to work on my slurm server and keep getting the following error:

srun: fatal: SLURM_MEM_PER_CPU, SLURM_MEM_PER_GPU, and SLURM_MEM_PER_NODE are mutually exclusive.

I can see that two resource arguments are being passed when looking at the rule description:

[Fri Apr 19 13:42:50 2024]
rule fastqc:
    input: /pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/raw_data/control_1_R1.fastq.gz, /pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/raw_data/co
ntrol_1_R2.fastq.gz
    output: /pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/results/fastqc_pre_trim/fastqc_control_1_summary_untrimmed.txt
    jobid: 0
    reason: Forced execution
    wildcards: results=/pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/results, sample=control_1
    resources: mem_mb=15259, mem_mib=7630, disk_mb=43311, disk_mib=41305, tmpdir=<TBD>, slurm_partition=acompile, slurm_account=amc-general, slurm_extra=--output=/pl/active/Anschutz_BDC/analysis/wells/analysis/we
lls/testing_snakemake8_slurm/results/logs/fastqc_pre_trim/fastqc_control_1_untrimmed.out --error=/pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/results/logs/fastqc_pre_trim/fastqc_
control_1_untrimmed.err --qos=compile, runtime=60, mem=16GB

However, I don't know where the mem_mb is being passed.

Profile

executor: slurm

default-resources:
    slurm_partition: "acompile"
    slurm_account:   "amc-general"

set-resources:
    fastqc:
        runtime: 60 # 1 hour
        mem: "16GB"
    fastqc_summary:
        runtime: 10
        mem: "4GB"

My rule

rule fastqc:
    input:
        input_list = _get_input
    output:
        file = "{results}/fastqc_pre_trim/fastqc_{sample}_summary_untrimmed.txt"
    params:
        output_dir  = os.path.join(RESULTS2, "fastqc_pre_trim"),
        directories = _get_directories
    resources:
        slurm_extra=lambda wildcards: (
            f"--output={wildcards.results}/logs/fastqc_pre_trim/fastqc_{wildcards.sample}_untrimmed.out "
            f"--error={wildcards.results}/logs/fastqc_pre_trim/fastqc_{wildcards.sample}_untrimmed.err "
            f"--qos=compile"
        )
    singularity:
       GENERAL_CONTAINER
    shell:
        """
        mkdir -p {params.output_dir}
        fastqc {input} --outdir {params.output_dir}
        for dir in {params.directories};
        do
            name=$(basename -s .zip $dir)

            unzip -p $dir $name/summary.txt \
                >> {output}
        done
        """    

my command

snakemake \
    --snakefile Snakefile \
    --configfile config.yaml \
    --jobs 12 \
    --latency-wait 60 \
    --rerun-incomplete \
    --use-singularity \
    --workflow-profile profiles/default

Attempted fix 1: use mem_mb

I have also tried this using the mem_mb argument instead

executor: slurm

default-resources:
    slurm_partition: "acompile"
    slurm_account:   "amc-general"

set-resources:
    fastqc:
        runtime: 60 # 1 hour
        mem_mb: 1600
    fastqc_summary:
        runtime: 10
        mem_mb: 4000

I get the same error, but the double memory request is less obvious

[Fri Apr 19 13:40:42 2024]
rule fastqc:
    input: /pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/raw_data/control_1_R1.fastq.gz, /pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/raw_data/co
ntrol_1_R2.fastq.gz
    output: /pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/results/fastqc_pre_trim/fastqc_control_1_summary_untrimmed.txt
    jobid: 0
    reason: Forced execution
    wildcards: results=/pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/results, sample=control_1
    resources: mem_mb=1600, mem_mib=1526, disk_mb=43311, disk_mib=41305, tmpdir=<TBD>, slurm_partition=acompile, slurm_account=amc-general, slurm_extra=--output=/pl/active/Anschutz_BDC/analysis/wells/analysis/wel
ls/testing_snakemake8_slurm/results/logs/fastqc_pre_trim/fastqc_control_1_untrimmed.out --error=/pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/results/logs/fastqc_pre_trim/fastqc_c
ontrol_1_untrimmed.err --qos=compile, runtime=60

srun: fatal: SLURM_MEM_PER_CPU, SLURM_MEM_PER_GPU, and SLURM_MEM_PER_NODE are mutually exclusive.

Attempted fix 2 remove profile and specify within rule

I have also tried this where I deleted my profile and just assigned the resources within the rule:

rule fastqc:
    input:
        input_list = _get_input
    output:
        file = "{results}/fastqc_pre_trim/fastqc_{sample}_summary_untrimmed.txt"
    resources:
        job_name="fastqc",
        mem_mb=1600,
        runtime=60,
        slurm_extra=lambda wildcards: (
            f"--output={wildcards.results}/logs/fastqc_pre_trim/fastqc_{wildcards.sample}_untrimmed.out "
            f"--error={wildcards.results}/logs/fastqc_pre_trim/fastqc_{wildcards.sample}_untrimmed.err "
            f"--qos=compile"
        )
    params:
        output_dir  = os.path.join(RESULTS2, "fastqc_pre_trim"),
        directories = _get_directories
    singularity:
       GENERAL_CONTAINER
    shell:
        """
        mkdir -p {params.output_dir}
        fastqc {input} --outdir {params.output_dir}
        for dir in {params.directories};
        do
            name=$(basename -s .zip $dir)

            unzip -p $dir $name/summary.txt \
                >> {output}
        done
        """   

Submit with:

snakemake \
    --snakefile Snakefile \
    --configfile config.yaml \
    --jobs 12 \
    --latency-wait 60 \
    --rerun-incomplete \
    --use-singularity \
    --executor slurm \
    --default-resources slurm_account=amc-general slurm_partition=acompile

But this also fails with the same srun error:

srun: fatal: SLURM_MEM_PER_CPU, SLURM_MEM_PER_GPU, and SLURM_MEM_PER_NODE are mutually exclusive.

Attempted fix 3 - submit with sbatch

I've also submitted the job per this issue but that gave the same error as above.

Conclusion

mem_mb is obviously specified somewhere but I am not sure where to look beyond the profile, rules, and snakemake command. Do you have any ideas what I may be missing? Thanks so much for your help!

However, I don't know where the mem_mb is being passed.

Your requirement should be translated to mem_mb and is sufficient. Snakemake merely lists both resources, but that should be fine as it only translates in sbatch --mem .... And indeed, within SLURM --mem and --mem-per-cpu are mutually exclusive. I will try to track this down. For this, it would be extremely helpful if you run Snakemake with --verbose and attach the output as a file. Also, please state your SLURM version (output of sinfo --version). Thank you.

PS would you be interested contributing your workflows to the snakemake-workflows catalogue? see https://snakemake.github.io/snakemake-workflow-catalog/ - some of your look pretty interesting!

Thanks for helping with this!

However, I don't know where the mem_mb is being passed.

Your requirement should be translated to mem_mb and is sufficient. Snakemake merely lists both resources, but that should be fine as it only translates in sbatch --mem .... And indeed, within SLURM --mem and --mem-per-cpu are mutually exclusive. I will try to track this down. For this, it would be extremely helpful if you run Snakemake with --verbose and attach the output as a file. Also, please state your SLURM version (output of sinfo --version). Thank you.

  • The slurm version is 23.02.2

  • Here's the output using --verbose from the master:

snakemake --snakefile Snakefile --configfile config.yaml --jobs 12 --latency-wait 60 --rerun-incomplete --use-singularity --workflow-profile profiles/default --verbose
Using workflow specific profile profiles/default for setting default command line arguments.
Building DAG of jobs...
shared_storage_local_copies: True
remote_exec: False
SLURM run ID: c846e871-127b-46a2-a3c5-559bfafd7f06
Using shell: /bin/bash
Provided remote nodes: 12
Job stats:
job               count
--------------  -------
all                   1
fastqc                1
fastqc_summary        1
total                 3

Resources before job selection: {'_cores': 9223372036854775807, '_nodes': 12}
Ready jobs (1)
Select jobs to execute...
Using greedy selector because only single job has to be scheduled.
Selected jobs (1)
Resources after job selection: {'_cores': 9223372036854775806, '_nodes': 11}
Execute 1 jobs...

[Mon Apr 22 08:49:54 2024]
rule fastqc:
    input: /pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/raw_data/control_1_R1.fastq.gz, /pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testin
g_snakemake8_slurm/raw_data/control_1_R2.fastq.gz
    output: /pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/results/fastqc_pre_trim/fastqc_control_1_summary_untrimmed.txt
    jobid: 2
    reason: Missing output files: /pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/results/fastqc_pre_trim/fastqc_control_1_summary_untrimmed.txt
    wildcards: results=/pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/results, sample=control_1
    resources: mem_mb=15259, mem_mib=7630, disk_mb=43311, disk_mib=41305, tmpdir=<TBD>, slurm_partition=acompile, slurm_account=amc-general, slurm_extra=--output=/pl/active/Anschutz_
BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/results/logs/fastqc_pre_trim/fastqc_control_1_untrimmed.out --error=/pl/active/Anschutz_BDC/analysis/wells/analysis/wells/t
esting_snakemake8_slurm/results/logs/fastqc_pre_trim/fastqc_control_1_untrimmed.err --qos=compile, runtime=60, mem=16GB

General args: ['--force', '--target-files-omit-workdir-adjustment', '--keep-storage-local-copies', '--max-inventory-time 0', '--nocolor', '--notemp', '--no-hooks', '--nolock', '--ign
ore-incomplete', '', '--verbose ', '--rerun-triggers code software-env mtime params input', '', '', '--deployment-method apptainer', '--conda-frontend mamba', '', '', '--apptainer-pr
efix /scratch/alpine/kwellswrasman@xsede.org/apptainer_cache', '', '', '--shared-fs-usage source-cache sources storage-local-copies persistence software-deployment input-output', '',
 '--wrapper-prefix https://github.com/snakemake/snakemake-wrappers/raw/', '', '', '--configfiles /pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/config
.yaml', '', '', '--latency-wait 60', '--scheduler ilp', '--local-storage-prefix .snakemake/storage', '--scheduler-solver-path /projects/kwellswrasman@xsede.org/software/anaconda/envs
/snakemake8/bin', '', '', '--set-resources base64//ZmFzdHFjOnJ1bnRpbWU9NjA= base64//ZmFzdHFjOm1lbT0xNkdC base64//ZmFzdHFjX3N1bW1hcnk6cnVudGltZT0xMA== base64//ZmFzdHFjX3N1bW1hcnk6bWVt
PTRHQg==', '', '', '--default-resources base64//bWVtX21iPW1pbihtYXgoMippbnB1dC5zaXplX21iLCAxMDAwKSwgODAwMCk= base64//ZGlza19tYj1tYXgoMippbnB1dC5zaXplX21iLCAxMDAwKQ== base64//dG1wZGly
PXN5c3RlbV90bXBkaXI= base64//c2x1cm1fcGFydGl0aW9uPWFjb21waWxl base64//c2x1cm1fYWNjb3VudD1hbWMtZ2VuZXJhbA==', '']
sbatch call: sbatch --job-name c846e871-127b-46a2-a3c5-559bfafd7f06 --output /pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/.snakemake/slurm_logs/rule
_fastqc/_/pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/results_control_1/%j.log --export=ALL --comment fastqc -A amc-general -p acompile -t 60 --mem 
15259 --ntasks=1 --cpus-per-task=1 --output=/pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/results/logs/fastqc_pre_trim/fastqc_control_1_untrimmed.out
 --error=/pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/results/logs/fastqc_pre_trim/fastqc_control_1_untrimmed.err --qos=compile -D /pl/active/Anschu
tz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm --wrap="/projects/kwellswrasman@xsede.org/software/anaconda/envs/snakemake8/bin/python3.12 -m snakemake --snakefile /pl/
active/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/Snakefile --target-jobs 'fastqc:results=/pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testing_snake
make8_slurm/results,sample=control_1' --allowed-rules 'fastqc' --cores all --attempt 1 --force-use-threads  --resources 'mem_mb=15259' 'mem_mib=7630' 'disk_mb=43311' 'disk_mib=41305'
 --wait-for-files '/pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/.snakemake/tmp.ch3f265w' '/pl/active/Anschutz_BDC/analysis/wells/analysis/wells/test
ing_snakemake8_slurm/raw_data/control_1_R1.fastq.gz' '/pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/raw_data/control_1_R2.fastq.gz' --force --target-
files-omit-workdir-adjustment --keep-storage-local-copies --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete --verbose  --rerun-triggers code software-
env mtime params input --deployment-method apptainer --conda-frontend mamba --apptainer-prefix /scratch/alpine/kwellswrasman@xsede.org/apptainer_cache --shared-fs-usage source-cache 
sources storage-local-copies persistence software-deployment input-output --wrapper-prefix https://github.com/snakemake/snakemake-wrappers/raw/ --configfiles /pl/active/Anschutz_BDC/
analysis/wells/analysis/wells/testing_snakemake8_slurm/config.yaml --latency-wait 60 --scheduler ilp --local-storage-prefix .snakemake/storage --scheduler-solver-path /projects/kwell
swrasman@xsede.org/software/anaconda/envs/snakemake8/bin --set-resources base64//ZmFzdHFjOnJ1bnRpbWU9NjA= base64//ZmFzdHFjOm1lbT0xNkdC base64//ZmFzdHFjX3N1bW1hcnk6cnVudGltZT0xMA== ba
se64//ZmFzdHFjX3N1bW1hcnk6bWVtPTRHQg== --default-resources base64//bWVtX21iPW1pbihtYXgoMippbnB1dC5zaXplX21iLCAxMDAwKSwgODAwMCk= base64//ZGlza19tYj1tYXgoMippbnB1dC5zaXplX21iLCAxMDAwKQ
== base64//dG1wZGlyPXN5c3RlbV90bXBkaXI= base64//c2x1cm1fcGFydGl0aW9uPWFjb21waWxl base64//c2x1cm1fYWNjb3VudD1hbWMtZ2VuZXJhbA== --executor slurm-jobstep --jobs 1 --mode remote"
Job 2 has been submitted with SLURM jobid 5783166 (log: /pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/.snakemake/slurm_logs/rule_fastqc/_/pl/active/A
nschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/results_control_1/5783166.log).
The job status was queried with command: sacct -X --parsable2 --noheader --format=JobIdRaw,State --starttime 2024-04-20T08:00 --endtime now --name c846e871-127b-46a2-a3c5-559bfafd7f0
6
It took: 0.058480262756347656 seconds
The output is:
'5783166|FAILED
'

status_of_jobs after sacct is: {'5783166': 'FAILED'}
active_jobs_ids_with_current_sacct_status are: {'5783166'}
active_jobs_seen_by_sacct are: {'5783166'}
missing_sacct_status are: set()
[Mon Apr 22 08:50:34 2024]
Error in rule fastqc:
    message: SLURM-job '5783166' failed, SLURM status is: 'FAILED'For further error details see the cluster/cloud log and the log files of the involved rule(s).
    jobid: 2
    input: /pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/raw_data/control_1_R1.fastq.gz, /pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testin
g_snakemake8_slurm/raw_data/control_1_R2.fastq.gz
    output: /pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/results/fastqc_pre_trim/fastqc_control_1_summary_untrimmed.txt
    log: /pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/.snakemake/slurm_logs/rule_fastqc/_/pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testi
ng_snakemake8_slurm/results_control_1/5783166.log (check log file(s) for error details)
    shell:
        
        mkdir -p /scratch/alpine/kwellswrasman@xsede.org/analysis/wells/analysis/wells/testing_snakemake8_slurm/results/fastqc_pre_trim
        fastqc /pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/raw_data/control_1_R1.fastq.gz /pl/active/Anschutz_BDC/analysis/wells/analysis/wells/tes
ting_snakemake8_slurm/raw_data/control_1_R2.fastq.gz --outdir /scratch/alpine/kwellswrasman@xsede.org/analysis/wells/analysis/wells/testing_snakemake8_slurm/results/fastqc_pre_trim
        for dir in /scratch/alpine/kwellswrasman@xsede.org/analysis/wells/analysis/wells/testing_snakemake8_slurm/results/fastqc_pre_trim/control_1_R1_fastqc.zip /scratch/alpine/kwel
lswrasman@xsede.org/analysis/wells/analysis/wells/testing_snakemake8_slurm/results/fastqc_pre_trim/control_1_R2_fastqc.zip;
        do
            name=$(basename -s .zip $dir)

            unzip -p $dir $name/summary.txt                 >> /pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/results/fastqc_pre_trim/fastqc_control_1
_summary_untrimmed.txt
        done
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    external_jobid: 5783166

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-04-22T084953.851472.snakemake.log
unlocking
removing lock
removing lock
removed all locks
Full Traceback (most recent call last):
  File "/projects/kwellswrasman@xsede.org/software/anaconda/envs/snakemake8/lib/python3.12/site-packages/snakemake/cli.py", line 2068, in args_to_api
    dag_api.execute_workflow(
  File "/projects/kwellswrasman@xsede.org/software/anaconda/envs/snakemake8/lib/python3.12/site-packages/snakemake/api.py", line 589, in execute_workflow
    workflow.execute(
  File "/projects/kwellswrasman@xsede.org/software/anaconda/envs/snakemake8/lib/python3.12/site-packages/snakemake/workflow.py", line 1285, in execute
    raise WorkflowError("At least one job did not complete successfully.")
snakemake_interface_common.exceptions.WorkflowError: At least one job did not complete successfully.

WorkflowError:
At least one job did not complete successfully.
raw_data/control_1_R1.fastq.gz

And the output from the job

Building DAG of jobs...
shared_storage_local_copies: True
remote_exec: True
Using shell: /bin/bash
Provided remote nodes: 1
Provided resources: mem_mb=15259, mem_mib=7630, disk_mb=43311, disk_mib=41305
Resources before job selection: {'mem_mb': 15259, 'mem_mib': 7630, 'disk_mb': 43311, 'disk_mib': 41305, '_cores': 9223372036854775807, '_nodes': 1}
Ready jobs (1)
Select jobs to execute...
Using greedy selector because only single job has to be scheduled.
Selected jobs (1)
Resources after job selection: {'mem_mb': 15259, 'mem_mib': 0, 'disk_mb': 43311, 'disk_mib': 0, '_cores': 9223372036854775806, '_nodes': 0}
Execute 1 jobs...

[Mon Apr 22 08:50:04 2024]
rule fastqc:
    input: /pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/raw_data/control_1_R1.fastq.gz, /pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testin
g_snakemake8_slurm/raw_data/control_1_R2.fastq.gz
    output: /pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/results/fastqc_pre_trim/fastqc_control_1_summary_untrimmed.txt
    jobid: 0
    reason: Forced execution
    wildcards: results=/pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/results, sample=control_1
    resources: mem_mb=15259, mem_mib=7630, disk_mb=43311, disk_mib=41305, tmpdir=<TBD>, slurm_partition=acompile, slurm_account=amc-general, slurm_extra=--output=/pl/active/Anschutz_
BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/results/logs/fastqc_pre_trim/fastqc_control_1_untrimmed.out --error=/pl/active/Anschutz_BDC/analysis/wells/analysis/wells/t
esting_snakemake8_slurm/results/logs/fastqc_pre_trim/fastqc_control_1_untrimmed.err --qos=compile, runtime=60, mem=16GB

General args: ['--force', '--target-files-omit-workdir-adjustment', '--keep-storage-local-copies', '--max-inventory-time 0', '--nocolor', '--notemp', '--no-hooks', '--nolock', '--ign
ore-incomplete', '', '--verbose ', '--rerun-triggers code mtime input params software-env', '', '', '--deployment-method apptainer', '--conda-frontend mamba', '', '', '--apptainer-pr
efix /scratch/alpine/kwellswrasman@xsede.org/apptainer_cache', '', '', '--shared-fs-usage sources source-cache software-deployment input-output persistence storage-local-copies', '',
 '--wrapper-prefix https://github.com/snakemake/snakemake-wrappers/raw/', '', '', '--configfiles /pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/config
.yaml', '', '', '--latency-wait 60', '--scheduler ilp', '--local-storage-prefix .snakemake/storage', '--scheduler-solver-path /projects/kwellswrasman@xsede.org/software/anaconda/envs
/snakemake8/bin', '', '', '--set-resources base64//ZmFzdHFjOnJ1bnRpbWU9NjA= base64//ZmFzdHFjOm1lbT0xNkdC base64//ZmFzdHFjX3N1bW1hcnk6cnVudGltZT0xMA== base64//ZmFzdHFjX3N1bW1hcnk6bWVt
PTRHQg==', '', '', '--default-resources base64//bWVtX21iPW1pbihtYXgoMippbnB1dC5zaXplX21iLCAxMDAwKSwgODAwMCk= base64//ZGlza19tYj1tYXgoMippbnB1dC5zaXplX21iLCAxMDAwKQ== base64//dG1wZGly
PXN5c3RlbV90bXBkaXI= base64//c2x1cm1fcGFydGl0aW9uPWFjb21waWxl base64//c2x1cm1fYWNjb3VudD1hbWMtZ2VuZXJhbA==']
This job is a group job: False
The call for this job is: srun -n1 --cpu-bind=q --cpus-per-task 1 /projects/kwellswrasman@xsede.org/software/anaconda/envs/snakemake8/bin/python3.12 -m snakemake --snakefile /pl/acti
ve/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/Snakefile --target-jobs 'fastqc:results=/pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake
8_slurm/results,sample=control_1' --allowed-rules 'fastqc' --cores all --attempt 1 --force-use-threads  --resources 'mem_mb=15259' 'mem_mib=7630' 'disk_mb=43311' 'disk_mib=41305' --f
orce --target-files-omit-workdir-adjustment --keep-storage-local-copies --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete --verbose  --rerun-triggers 
code mtime input params software-env --deployment-method apptainer --conda-frontend mamba --apptainer-prefix /scratch/alpine/kwellswrasman@xsede.org/apptainer_cache --shared-fs-usage
 sources source-cache software-deployment input-output persistence storage-local-copies --wrapper-prefix https://github.com/snakemake/snakemake-wrappers/raw/ --configfiles /pl/active
/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/config.yaml --latency-wait 60 --scheduler ilp --local-storage-prefix .snakemake/storage --scheduler-solver-path /
projects/kwellswrasman@xsede.org/software/anaconda/envs/snakemake8/bin --set-resources base64//ZmFzdHFjOnJ1bnRpbWU9NjA= base64//ZmFzdHFjOm1lbT0xNkdC base64//ZmFzdHFjX3N1bW1hcnk6cnVud
GltZT0xMA== base64//ZmFzdHFjX3N1bW1hcnk6bWVtPTRHQg== --default-resources base64//bWVtX21iPW1pbihtYXgoMippbnB1dC5zaXplX21iLCAxMDAwKSwgODAwMCk= base64//ZGlza19tYj1tYXgoMippbnB1dC5zaXpl
X21iLCAxMDAwKQ== base64//dG1wZGlyPXN5c3RlbV90bXBkaXI= base64//c2x1cm1fcGFydGl0aW9uPWFjb21waWxl base64//c2x1cm1fYWNjb3VudD1hbWMtZ2VuZXJhbA== --mode remote
Job is running on host: c3cpu-a2-u32-1.rc.int.colorado.edu
srun: fatal: SLURM_MEM_PER_CPU, SLURM_MEM_PER_GPU, and SLURM_MEM_PER_NODE are mutually exclusive.
[Mon Apr 22 08:50:04 2024]
Error in rule fastqc:
    jobid: 0
    input: /pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/raw_data/control_1_R1.fastq.gz, /pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testin
g_snakemake8_slurm/raw_data/control_1_R2.fastq.gz
    output: /pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/results/fastqc_pre_trim/fastqc_control_1_summary_untrimmed.txt
    shell:
        
        mkdir -p /scratch/alpine/kwellswrasman@xsede.org/analysis/wells/analysis/wells/testing_snakemake8_slurm/results/fastqc_pre_trim
        fastqc /pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/raw_data/control_1_R1.fastq.gz /pl/active/Anschutz_BDC/analysis/wells/analysis/wells/tes
ting_snakemake8_slurm/raw_data/control_1_R2.fastq.gz --outdir /scratch/alpine/kwellswrasman@xsede.org/analysis/wells/analysis/wells/testing_snakemake8_slurm/results/fastqc_pre_trim
        for dir in /scratch/alpine/kwellswrasman@xsede.org/analysis/wells/analysis/wells/testing_snakemake8_slurm/results/fastqc_pre_trim/control_1_R1_fastqc.zip /scratch/alpine/kwel
lswrasman@xsede.org/analysis/wells/analysis/wells/testing_snakemake8_slurm/results/fastqc_pre_trim/control_1_R2_fastqc.zip;
        do
            name=$(basename -s .zip $dir)

            unzip -p $dir $name/summary.txt                 >> /pl/active/Anschutz_BDC/analysis/wells/analysis/wells/testing_snakemake8_slurm/results/fastqc_pre_trim/fastqc_control_1
_summary_untrimmed.txt
        done
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Storing output in storage.
Full Traceback (most recent call last):
  File "/projects/kwellswrasman@xsede.org/software/anaconda/envs/snakemake8/lib/python3.12/site-packages/snakemake/cli.py", line 2068, in args_to_api
    dag_api.execute_workflow(
  File "/projects/kwellswrasman@xsede.org/software/anaconda/envs/snakemake8/lib/python3.12/site-packages/snakemake/api.py", line 589, in execute_workflow
    workflow.execute(
  File "/projects/kwellswrasman@xsede.org/software/anaconda/envs/snakemake8/lib/python3.12/site-packages/snakemake/workflow.py", line 1285, in execute
    raise WorkflowError("At least one job did not complete successfully.")
snakemake_interface_common.exceptions.WorkflowError: At least one job did not complete successfully.

WorkflowError:
At least one job did not complete successfully.

PS would you be interested contributing your workflows to the snakemake-workflows catalogue? see https://snakemake.github.io/snakemake-workflow-catalog/ - some of your look pretty interesting!

I could definitely do that! I'll add it to my todo list!

Bad news: I cannot reproduce this behaviour. edit: My SLURM version is 23.02.7.

I noticed that you are overwriting of --output by slurm_extra. It does not produce the error. Yet, we have our log file at os.path.abspath(f".snakemake/slurm_logs/{group_or_rule}/{wildcard_str}/%j.log") as gets reported by the plugin.

My Snakefile is

 rule all:
     input: "results/2.out"

rule test1:
     output: "results/2.out"
     #threads: 2
     resources:
        cpus_per_task=2,
        slurm_extra="--output='somewhere_%j.log'"
     shell: "touch results/$SLURM_CPUS_PER_TASK.out"

My profile:

default-resources:
    slurm_partition: "smallcpu"
    slurm_account: "nhr-zdvhpc" #"m2_zdvhpc"

set-resources:
    test1:
        runtime: 5
        mem_mb: 1800

Does this produce the observed error, too?

That's unfortunate that you can't reproduce it.

You are completely correct, using your profile (changing the partition and account) and your Snakefile I get the same error. But the error still occurs when I remove the slurm_extra argument so it doesn't seem to be coming from overwriting --output.

... I get the same error.

That is not what I wanted to read ;-)

Assuming you have this script:

#!/bin/bash

#SBATCH --mem 100
#SBATCH -A amc-general 
#SBATCH -p acompile
#SBATCH -t 5

srun echo "Hello world"

and you run sbatch <this script> . Does your SLURM output contain the error, too? I mean, we observe the call to srun in the jobstep executor to NOT include any memory setting, and weirdly you still see this error.

You are good, that produced the exact same error. Seems to be an issue with my system and not snakemake (probably what you did want to hear!)

I'll reach out to our system administrators. Thank you so much for all of your help!

probably what you did want to hear

Not really. It is some sort of relief, though. I know that it takes effort to update SLURM, if my colleagues are bitten by a bug — but then again, I would be surprised if you are the first to report.

Thanks for the feedback. I will keep this issue open, if you don't mind, and await further feedback. Perhaps, it turns out to be a corner case, we can mitigate.

Sounds great, we are working on it and have so far figured out that this works

#!/bin/bash

#SBATCH --mem 100
#SBATCH -A amc-general 
#SBATCH -p acompile
#SBATCH -t 5

srun --mem 100 echo "Hello world"

I will let you know if we make any progress.

urgh, is redundancy a new hobby of SchedMD or is there a technical reason behind it (just a rhetorical question!)? I need to check a couple (read: two, for I do not have more and ask colleagues to do the same) of SLURM versions when I contribute the duplication into the code. I am not sure whether or where there might be side effects.

Also, as “my” most current version of SLURM is slightly more up to date than yours, I have to presume, that this is a quirk of your cluster.

This is likely a quirk of my cluster. We will definitely keep working on our side to see if there are good fixes.

Again, thanks so much for your help!

I might have found the problem... Our cluster is currently going through some growing pains so the best way to get an interactive job is by staring an interactive vscode session. When I submit the snakemakejobs from within the interactivevscode` session I get the error, but I don't when submitting from a normal interactive node.

So the slurm integration seems to work well as long as I'm not running through vscode.

Ah, the issue is that you submit whilst working within job context. I'm afraid, that's not what we designed the plugin for. It should not be an issue either, at least that issue of yours should not arise.

Now, we can certainly detect this and program a fat warning. I wonder, however, whether falling back on the actual SLURM executor instead of the jobstep executor is possible as a reaction. Either way, I will keep this issue open until I have an answer to this question.