Error in rule rule apply_quality_filter
Closed this issue · 8 comments
- I checked and didn't find help in related issues,e.g. while typing the title
- I got an error in the following rule(s):
apply_quality_filter
- I checked the log files indicated indicated in the error message (and the cluster logs if submitted to a cluster)
Error in rule apply_quality_filter:
jobid: 44
input: S221200000966/sequence_quality_control/S221200000966_deduplicated_R1.fastq.gz, S221200000966/sequence_quality_control/S221200000966_deduplicated_R2.fastq.gz, [...]dat/db/atlas/adapters.fa
output: S221200000966/sequence_quality_control/S221200000966_filtered_R1.fastq.gz, S221200000966/sequence_quality_control/S221200000966_filtered_R2.fastq.gz, S221200000966/sequence_quality_control/S221200000966_filtered_se.fastq.gz, S221200000966/logs/S221200000966_quality_filtering_stats.txt
log: S221200000966/logs/QC/quality_filter.log (check log file(s) for error details)
conda-env: [...]dat/db/atlas/conda_envs/baf452aa6f8992d0d74d69497829a56b_
shell:
bbduk.sh in1=S221200000966/sequence_quality_control/S221200000966_deduplicated_R1.fastq.gz in2=S221200000966/sequence_quality_control/S221200000966_deduplicated_R2.fastq.gz ref=[...]dat/db/atlas/adapters.fa interleaved=f out=S221200000966/sequence_quality_control/S221200000966_filtered_R1.fastq.gz out2=S221200000966/sequence_quality_control/S221200000966_filtered_R2.fastq.gz outs=S221200000966/sequence_quality_control/S221200000966_filtered_se.fastq.gz stats=S221200000966/logs/S221200000966_quality_filtering_stats.txt overwrite=true qout=33 trd=t hdist=1 k=27 ktrim=r mink=8 trimq=10 qtrim=rl threads=8 minlength=51 maxns=-1 minbasefrequency=0.05 ecco=t prealloc=t pigz=t unpigz=t -Xmx102G 2> S221200000966/logs/QC/quality_filter.log
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
cluster_jobid: 1525371
Here is the relevant log output:
java -ea -Xmx102G -Xms102G -cp [...]dat/db/atlas/conda_envs/baf452aa6f8992d0d74d69497829a56b_/opt/bbmap-39.01-1/current/ jgi.BBDuk in1=S221200000966/sequence_quality_control/S221200000966_deduplicated_R1.fastq.gz in2=S221200000966/sequence_quality_control/S221200000966_deduplicated_R2.fastq.gz ref=[...]dat/db/atlas/adapters.fa interleaved=f out=S221200000966/sequence_quality_control/S221200000966_filtered_R1.fastq.gz out2=S221200000966/sequence_quality_control/S221200000966_filtered_R2.fastq.gz outs=S221200000966/sequence_quality_control/S221200000966_filtered_se.fastq.gz stats=S221200000966/logs/S221200000966_quality_filtering_stats.txt overwrite=true qout=33 trd=t hdist=1 k=27 ktrim=r mink=8 trimq=10 qtrim=rl threads=8 minlength=51 maxns=-1 minbasefrequency=0.05 ecco=t prealloc=t pigz=t unpigz=t -Xmx102G
Executing jgi.BBDuk [in1=S221200000966/sequence_quality_control/S221200000966_deduplicated_R1.fastq.gz, in2=S221200000966/sequence_quality_control/S221200000966_deduplicated_R2.fastq.gz, ref=[...]dat/db/atlas/adapters.fa, interleaved=f, out=S221200000966/sequence_quality_control/S221200000966_filtered_R1.fastq.gz, out2=S221200000966/sequence_quality_control/S221200000966_filtered_R2.fastq.gz, outs=S221200000966/sequence_quality_control/S221200000966_filtered_se.fastq.gz, stats=S221200000966/logs/S221200000966_quality_filtering_stats.txt, overwrite=true, qout=33, trd=t, hdist=1, k=27, ktrim=r, mink=8, trimq=10, qtrim=rl, threads=8, minlength=51, maxns=-1, minbasefrequency=0.05, ecco=t, prealloc=t, pigz=t, unpigz=t, -Xmx102G]
Version 39.01
Set INTERLEAVED to false
Set threads to 8
Initial size set to 1014718593
maskMiddle was disabled because useShortKmers=true
[...]dat/db/atlas/conda_envs/baf452aa6f8992d0d74d69497829a56b_/bin/bbduk.sh: line 370: 2563011 Killed java -ea -Xmx102G -Xms102G -cp [...]dat/db/atlas/conda_envs/baf452aa6f8992d0d74d69497829a56b_/opt/bbmap-39.01-1/current/ jgi.BBDuk in1=S221200000966/sequence_quality_control/S221200000966_deduplicated_R1.fastq.gz in2=S221200000966/sequence_quality_control/S221200000966_deduplicated_R2.fastq.gz ref=[...]dat/db/atlas/adapters.fa interleaved=f out=S221200000966/sequence_quality_control/S221200000966_filtered_R1.fastq.gz out2=S221200000966/sequence_quality_control/S221200000966_filtered_R2.fastq.gz outs=S221200000966/sequence_quality_control/S221200000966_filtered_se.fastq.gz stats=S221200000966/logs/S221200000966_quality_filtering_stats.txt overwrite=true qout=33 trd=t hdist=1 k=27 ktrim=r mink=8 trimq=10 qtrim=rl threads=8 minlength=51 maxns=-1 minbasefrequency=0.05 ecco=t prealloc=t pigz=t unpigz=t -Xmx102G
job info
sacct -j 1525371
JobID JobName User Partition NodeList Elapsed State ExitCode MaxRSS AllocTRES
-------------------- ---------- --------- ---------- --------------- ---------- ---------- -------- ---------- --------------------------------
1525371 apply_qua+ x base n165 00:01:04 OUT_OF_ME+ 0:125 billing=12,cpu=8,mem=18246M,nod+
1525371.batch batch n165 00:01:04 OUT_OF_ME+ 0:125 18556844K cpu=8,mem=18246M,node=1
1525371.extern extern n165 00:01:04 OUT_OF_ME+ 0:125 16K billing=12,cpu=8,mem=18246M,nod+
Atlas version: 2.16.2 (dev)
Additional context
- the problem is obviously too less memory
- I found a related issue #519
- increasing memory in
config.yaml
(mem=120
,simplejob_mem=500
) didn't help
How can I specifically increase the memory needed for apply_quality_filter
jobs?
Can you increase mem even more? simpejob_mem anc be less.
If you set mem to 120 the bbmap gets 85% of it =102.
If I set mem: 500
in the config.yaml, the error remains.
It doesn't seem to influence resource allocation at all
sacct -j 1525720
JobID JobName User Partition NodeList Elapsed State ExitCode MaxRSS AllocTRES
-------------------- ---------- --------- ---------- --------------- ---------- ---------- -------- ---------- --------------------------------
1525720 apply_qua+ x base n155 00:01:03 OUT_OF_ME+ 0:125 billing=10,cpu=8,mem=8885M,node+
1525720.batch batch n155 00:01:03 OUT_OF_ME+ 0:125 8595640K cpu=8,mem=8885M,node=1
1525720.extern extern n155 00:01:03 OUT_OF_ME+ 0:125 28K billing=10,cpu=8,mem=8885M,node+
Ok, then there is a problem form conversion of passing the memory to the cluster.
What cluster wrapper do you use, and which resource keyword does it accept?
As there was no clear standard for naming resources each rule in atlas should have three resource atributes:
mem
in gbmem_mb
mem_mib
(the new standard, as far as I can see)
The only thing that can go wrong is to interpret mem as mb.
hi @SilasK
I tried to investigate this a bit further.
a working rule
rule get_read_stats:
input: S221200000963/sequence_quality_control/S221200000963_raw_R1.fastq.gz, S221200000963/sequence_quality_control/S221200000963_raw_R2.fastq.gz
output: S221200000963/sequence_quality_control/read_stats/raw.zip, S221200000963/sequence_quality_control/read_stats/raw_read_counts.tsv
log: S221200000963/logs/QC/read_stats/raw.log
jobid: 11
reason: Missing output files: S221200000963/sequence_quality_control/read_stats/raw.zip, S221200000963/sequence_quality_control/read_stats/raw_read_counts.tsv
wildcards: sample=S221200000963, step=raw
priority: 30
threads: 4
resources: mem_mb=9187, disk_mb=9187, disk_mib=8762, tmpdir=<TBD>, mem=10, java_mem=8, mem_mib=9187, time_min=300, runtime=300
sacct -j 1530622
JobID JobName User Partition NodeList Elapsed State ExitCode MaxRSS AllocTRES
-------------------- ---------- --------- ---------- --------------- ---------- ---------- -------- ---------- --------------------------------
1530622 get_read_+ x base n107 00:02:30 COMPLETED 0:0 billing=6,cpu=4,mem=9187M,node=1
1530622.batch batch n107 00:02:30 COMPLETED 0:0 5216672K cpu=4,mem=9187M,node=1
1530622.extern extern n107 00:02:30 COMPLETED 0:0 132K billing=6,cpu=4,mem=9187M,node=1
a rule that does not work (duplicate_reads
throws the same error as apply_quality_filter
)
rule deduplicate_reads:
input: S221200000963/sequence_quality_control/S221200000963_raw_R1.fastq.gz, S221200000963/sequence_quality_control/S221200000963_raw_R2.fastq.gz
output: S221200000963/sequence_quality_control/S221200000963_deduplicated_R1.fastq.gz, S221200000963/sequence_quality_control/S221200000963_deduplicated_R2.fastq.gz
log: S221200000963/logs/QC/deduplicate.log
jobid: 6
benchmark: logs/benchmarks/QC/deduplicate/S221200000963.txt
reason: Missing output files: S221200000963/sequence_quality_control/S221200000963_deduplicated_R2.fastq.gz, S221200000963/sequence_quality_control/S221200000963_deduplicated_R1.fastq.gz
wildcards: sample=S221200000963
threads: 8
resources: mem_mb=9187, disk_mb=9187, disk_mib=8762, tmpdir=<TBD>, mem=200, java_mem=170, mem_mib=9187, time_min=300, runtime=300
sacct -j 1530620
JobID JobName User Partition NodeList Elapsed State ExitCode MaxRSS AllocTRES
-------------------- ---------- --------- ---------- --------------- ---------- ---------- -------- ---------- --------------------------------
1530620 deduplica+ x base n193 00:01:18 OUT_OF_ME+ 0:125 billing=10,cpu=8,mem=9187M,node+
1530620.batch batch n193 00:01:18 OUT_OF_ME+ 0:125 9323056K cpu=8,mem=9187M,node=1
1530620.extern extern n193 00:01:18 OUT_OF_ME+ 0:125 80K billing=10,cpu=8,mem=9187M,node+
The difference between the two rules is that java_mem
is calculated one-time using simplejob_mem
(working) and the other time using mem
(not working) as defined in rules/qc.smk
.
Somehow the memory allocation when submitting the job is incorrect because the resources for the failing job are resources: mem_mb=9187, disk_mb=9187, disk_mib=8762, tmpdir=<TBD>, mem=200, java_mem=170, mem_mib=9187, time_min=300, runtime=300
(rule deduplicate_reads, see above).
It seems the job asks for 9187M, but the java memory for clumpify
is much higher (mem*JAVA_MEM_FRACTION
=200*0.85=170).
As far as I can see, this does not match. What do you think?
Here is some code to define all the different ways how the resources are defined
https://github.com/metagenome-atlas/atlas/blob/master/workflow/Snakefile#L405C25-L405C25
Is is screwed up if mem_mb is defined by default.
I guess in your cluster profile it somehow defines the --default_resources which makes this not work.
Which profile do you use? Do you use the slurm submission directly integrated into snakemake.
The best way to fix this is to deactivate the definition of default resources in your profile.
Alternatively it might work to set:
atlas run ... --default-resources mem=200 mem_mb=200000 mem_mib=200000 java_mem=170
short follow-up here, using the resource parameters you provided seems to work!
I'm using the default cluster profile (atlas run all --profile cluster). As you suggested, there might be an issue with the resources variable on our cluster. I will reach out to our cluster support.
@SilasK I'm not sure if this merits opening an issue, but I thought I'd mention a problem I'm encountering with default resources + cluster execution that sounds quite similar to that encountered by @jotech
I'm using the slurm cluster profile available from the cookiecutter command suggested for use in the ATLAS docs (in the cluster execution section).
It seems default resources are overriding the resource parameter values set in the ATLAS rule definitions, although I'm not defining default resources on the command line or setting them else where.
- For example, the
initialize_qc
rule setsmem=config["simplejob_mem"]
, which is 10 (Gb) in the config.yaml. From what I can tell, thismem
parameter is converted tomem_mb
near the end of the ATLAS Snakefile IFmem_mb
is not listed in the job resources. However, snakemake submits the job with the following resources specified:resources: mem_mb=3992, disk_mb=3992, disk_mib=3808, tmpdir=<TBD>, mem=10, java_mem=8, mem_mib=3992, time_min=300, runtime=300
. Clearly themem
parameter isn't being converted tomem_mb
, but rather default resource values are being applied from some source unknown to me.
I'm currently looking for a solution to prevent snakemake's default resources from being applied, other than the --default-resources mem_mb=None
EDIT:
I've added the --default-resources mem_mb=None disk_mb=None
to my ATLAS call, which removed the limiting default values that were killing my jobs.
However, with a bit of inspection-via-print() in the Snakefile section where mem is converted to mem_mb, r.resources["mem_mb"] exists, and therefore the if statement is not entered. The value of r.resources["mem_mb"] is some DefaultResources function from what I can tell (it prints as <function DefaultResources.__init__.<locals>.fallback.<locals>.callable at 0x7f34aba8cd30>
).
In the snakemake output at rule submission, it lists no value for mem_mb
anymore (it's not setting it to a default value; that's good), and mem
is set to the corresponding value in the config.yaml file. The mem value is not being converted to mem_mb as I believe is intended in the abovementioned section of the Snakefile.
I've noticed in my investigation that many rules set the mem
parameter, but some set the mem_mb
directly. I've also notice that some rules set mem_mb
to a config value * 1000, essentially converting the config Gb value to mb (e.g. rule download_gunc
), whereas other jobs simply set mem_mb
equal to the config Gb value, without multiplying to convert Gb to mb (e.g. rule run_gunc
). Is there a pattern to these different methods of setting mem_mb? Why not set at rule resources mem_mb=config.<appropriate mem>
across the board?