marbl/MHAP

MHAP running time

Closed this issue · 13 comments

Hello,
I would like to ask if is there an issue to estimate running time of MHAP Celera assembler, in a single machine.I have 14x 2Gbases of Pacbio reads and a machine with 80CPU and 1Toctets of memory.
Does the ovlRefBlockSize parameter has an impact on speed ?

Thank you

The running time is dependent on the number/length of sequences as well as your machine clock speed and I/O so it's difficult to estimate exact runtimes for your system. You can see some recent runtimes with CA 8.3 (the overlap time is MHAP) on the release notes page:
http://wgs-assembler.sourceforge.net/wiki/index.php/Version_8.3_Release_Notes

A human 70X dataset takes on the order of 8000 cpu hours so on your system it would be about 4-5 days using all cores. However, I assume you're trying to run self-correction and assembly on your dataset. Having only 14X is quite low and you will likely not get a great assembly. You also would need to use the adjusted parameters for lower coverage in which case the most expensive part of the computation won't be MHAP's overlapping but instead would be the consensus for the corrected sequences. ovlRefBlockSize will impact how big each block size is and thus how much memory the program uses while running. It will be faster to have fewer blocks but the recommended option is to set the ovlMemory parameter instead as it will automatically adjust ovlRefBlockSize to fit the specified memory. I'd suggest reading the wiki page for the self-correction and assembly pipeline to get more info on recommended parameters:
http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR

For your system, I'd set the ovlMemory to 128, threads to 8, and ovlConcurrency to 10.

Thank you Sergey for this nice explanation. I would let you know that my genome size is about 2Gbases. I did a prior quick hybrid correction with Illumina reads using Lordec tool. Output reads showed low identity (average ~75%). So, then I have adjusted the parameters for the PBcR pipeline as it was done for the human reads (identity average 82.98%) in your paper "Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing" :
asmOvlErrorRate=0.10
asmUtgErrorRate=0.07
asmCgwErrorRate=0.10
asmCnsErrorRate=0.10
utgGraphErrorRate = 0.07
utgGraphErrorLimit = 3.25
utgMergeErrorRate = 0.0825
utgMergeErrorLimit = 5.25
asmOBT = 0

However, I'm confused if I could adjust later for the assembly the ovlMinLen value to 40% of the average length of the corrected sequences in order to have good assembly results with few missassemblies, Or if it is better in my case to keep this value smaller to boost assembly statistics?

I would follow the suggestions for low coverage on the wiki page I linked to earlier rather than the human settings. The human genome had 54X coverage, much higher than your genome. You don't need to adjust the ovlMinLen value, it will be set for you by the pipeline. However, as I said, a 14X assembly will necessarily not be that good because you will have very few long sequences to span the large repeats in your genome. You can see the plot of assembly continuity for the yeast genome on the release notes page (http://wgs-assembler.sourceforge.net/wiki/index.php/Version_8.3_Release_Notes) and you can see the continuity drops off significantly below 30X.

The identities you report seem strange, the raw pacbio sequences should have higher than 75% identity on average (closer to 85-88%). The corrected reads for the human genome were over 99% accurate and the low coverage parameters generate slightly lower quality reads (closer to 97% identity) but still far over 90%.

I agree with you. In fact 75% average identity was estimated basing on the
number of bases corrected by Lordec, so reads should have more identities.

However could you advice me to disable asmOBT (=0 ) to avoid trimming and
generate thus longer reads, or make it enabled (=1) as it's shown in the
wiki page ?

2015-06-30 17:26 GMT+02:00 Sergey Koren notifications@github.com:

I would follow the suggestions for low coverage on the wiki page I linked
to earlier rather than the human settings. The human genome had 54X
coverage, much higher than your genome. You don't need to adjust the
ovlMinLen value, it will be set for you by the pipeline. However, as I
said, a 14X assembly will necessarily not be that good because you will
have very few long sequences to span the large repeats in your genome. You
can see the plot of assembly continuity for the yeast genome on the release
notes page (
http://wgs-assembler.sourceforge.net/wiki/index.php/Version_8.3_Release_Notes)
and you can see the continuity drops off significantly below 30X.

The identities you report seem strange, the raw pacbio sequences should
have higher than 75% identity on average (closer to 85-88%). The corrected
reads for the human genome were over 99% accurate and the low coverage
parameters generate slightly lower quality reads (closer to 97% identity)
but still far over 90%.


Reply to this email directly or view it on GitHub
#7 (comment).

The wiki is always the most updated guidance so I'd leave asmOBT on.

Thank you. Another question :) Do you think that combining Hybrid Lordec
with Mahp Self error correction could be a good strategy. The problem is
that with Lordec I loose the quality values information which is used by
PBcR .

2015-06-30 18:02 GMT+02:00 Sergey Koren notifications@github.com:

The wiki is always the most updated guidance so I'd leave asmOBT on.


Reply to this email directly or view it on GitHub
#7 (comment).

You can assemble lordec corrected PacBio sequences with Celera Assembler, that is fine. You can see suggestions on how to run the assembly on the wiki page:
http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR#Assembly_of_Corrected_Sequences

However, I would not run PBcR/MHAP on the corrected lordec data.

Ok Thank you Sergey for all these nice explanations.

2015-06-30 18:09 GMT+02:00 Sergey Koren notifications@github.com:

You can assemble lordec corrected PacBio sequences with Celera Assembler,
that is fine. You can see suggestions on how to run the assembly on the
wiki page:

http://wgs-assembler.sourceforge.net/wiki/index.php/PBcR#Assembly_of_Corrected_Sequences

However, I would not run PBcR/MHAP on the corrected lordec data.


Reply to this email directly or view it on GitHub
#7 (comment).

In the wiki page they don't talk about the OvlMinLen. Have I to adjust it to 40% of the average read length or it would be set automatically by the runCA piepline. The default parameter is 40.

You don't need to adjust it, the default is fine.

Thank you . This is a very important information for me since I would be
able to use indeed the pacbio data my Mate pairs reads (20kb and 5 kb) for
the runCa assembly.

2015-06-30 18:53 GMT+02:00 Sergey Koren notifications@github.com:

You don't need to adjust it, the default is fine.


Reply to this email directly or view it on GitHub
#7 (comment).

I would not recommend running Celera Assembler with Illumina mate pairs, it is computationally expensive and non-trivial to run it for assembly of high-coverage short read datasets for large genomes. Instead, I'd assemble just corrected PacBio data and use the mate-pairs for scaffolding the resulting contigs using a standalone scaffolder like OPERA-LG or SSPACE.

Ok. It is well noted :)

2015-06-30 19:13 GMT+02:00 Sergey Koren notifications@github.com:

I would not recommend running Celera Assembler with Illumina mate pairs,
it is computationally expensive and non-trivial to run it for assembly of
high-coverage short read datasets for large genomes. Instead, I'd assemble
just corrected PacBio data and use the mate-pairs for scaffolding the
resulting contigs using a standalone scaffolder like OPERA-LG or SSPACE.


Reply to this email directly or view it on GitHub
#7 (comment).