GATB/MindTheGap

Extremely Large Run-Time in 'Contig-Fill' Mode

Closed this issue · 5 comments

Hi, I'm running MindTheGap version 2.2.2 in the contig gap-filling mode. I gave it 28 threads to run off of and started it on May 27. I logged into my computer to check for any updates, and it estimated a remaining time of 115,085 minutes (just under 80 days). Is this what should be expected?

As a note - I am threading this process, not parallelizing it.

Hi,

Indeed, this is an extremely large running time estimation !
The running time depends mainly on the number of gap-fllings to perform (number of contigs in the input contig file) and on the complexity of the de Bruijn Graph together with the parameters controlling its exploration (-max-nodesand -max-length).

Could you indicate how many contigs do you have in your input contig file ? Did you use some non-default parameter values ?
Also, what type of genome are you trying to gap-fill from what type of sequencing data ?

Claire

Hi Claire,

Thanks for creating such a cool tool! I want to follow up on this as I am encountering a similar issue and hope this might help future users. Currently I am working with a single metagenomic water sample. I am testing it through Snakemake and am loading a single conda environment instance for submitting this single run. Please let me know if there is anything else I could clarify add to this.

Here are some summary stats:

--- [STAT] 38235 contigs, total 116895963 bp, min 1000 bp, max 97335 bp, avg 3057 bp, N50 4204 bp

Snakemake rule (I have been a little unclear if these are the available resources, I had requested 400GB RAM, but am still clarifying with our sys admin if I'm actually receiving that much):

rule mindthegap_fill_C_M6:
input: C_M6_1.trimmed.fastq.gz, C_M6_2.trimmed.fastq.gz, C_M6.contigs.fa
output: C_M6.insertions.fa
log: C_M6.fill.log
jobid: 26
resources: mem_mb=5773, disk_mb=5773, tmpdir=/tmp

Here are my input parameters, I dropped the -max-disk, -max-memory, -max-nodes to zero (assuming that this will utilize all available within the requested SLURM queue:

MindTheGap fill -in {input.reads_1},{input.reads_2} -contig {input.contigs} -out {params.sample} -nb-cores 0 -max-disk 0 -max-memory 0 -max-nodes 'default' -max-length 'default'

And some information from my log where each percent fill takes between 30-40 minutes:

[Graph: nb branching found : 1707875 ] 100 % elapsed: 0 min 13 sec remaining: 0 min 0 sec cpu: 787.9 % mem: [1173, 1173, 3843] MB
[Filling the contigs ] 0 % elapsed: 0 min 0 sec remaining: 0 min 0 sec
[Filling the contigs ] 1 % elapsed: 37 min 19 sec remaining: 3693 min 51 sec

Avery

Hi Avery,

thanks for using MindTheGap.

Given the large number of gap-fillings to perform (38,000 contigs implies almost 80,000 gap-fillings to perform), the observed running time may not be unusual.

The total running time grows linearly with the number of input contigs. So, one efficient way to reduce the running time is to reduce the number of input contigs. You can try to select the contigs based for instance on their size or read coverage.

The running time per gap-fill (or contig) depends mainly on the two parameters limiting the de bruijn graph exploration (-max-nodes and -max-length). I am not familiar with Snakemake, are you sure they are fixed to their default values (100 and 10000 respectively) ? You can also try to reduce these values (for instance 50 and 5,000).

I hope this helps,
best,
Claire

Hi Claire,

Thanks for you response! This was exactly the issue I was having.

My solution was to parse out only input contigs ascribed to our Phyla of interest and running this reduced set with MindTheGap. This significantly reduced the runtime.

Hi @tuck82er,

Glad it helps and I hope you will get interesting with MindTheGap.

In case you did not know, we have also developed a full pipeline to assemble specific genomes of interest from metagenomic samples, it is called MinYS and it includes MindTheGap for the gap-filling between contigs. Even if you do not use the full pipeline, you may be interested in the last step which consists in simplifying the output genome graph (for instance, there are post-processing scripts after gap-filling, to remove redundancy due to gap-fillings in both forward and reverse directions or to convert the GFA in a fasta file). (see also the paper with applications to some symbiotic communities)

Best,
Claire