yechengxi/DBG2OLC

Choose the best assembly

uceleste opened this issue · 2 comments

Dear Chengxi Ye,

First, I want to say that I'm new to this kind of processes. So, thank you in advance for your time and patience.
I have performed numerous assemblies with DBG2OLC (about forty!). I noticed that, in general, using more stringent conditions (so increasing KmerCovTh, MinOverlap and AdaptiveTh) the most important qualitative statistics (i.e. N50, average contig size, number of contigs, ecc.) tend to improve. On the other hand, this leads to a reduction of the total assembled bases.
I'm assembling a genome 360 Mbp and I have obtained assemblies like these (simplifying):

(1) Number of assembled bases: 358,530,138
N50: 591,435
(2) Number of assembled bases: 324,113,093
N50: 822,237
(3) Number of assembled bases: 305,918,712
N50: 1,130,615

As you can see, the N50 value of the test number (3) is much higher than the test number (1), but the number of assembled bases is much lower than the estimated genome size.
Going to the point, my questions are:
(a) What kind of sequences are typically eliminated by increasing the stringent conditions?
(b) How can I decide which of the assemblies I've obtained turns out to be the best? Is it better to keep close to the estimated size of the genome?

Thanks for any advice!

Yes. According to my experience, keeping the assembly that is closest to your estimated genome size is the rule of thumb. Usually some lower supported regions are removed by those stringent settings as they share similar characteristics with sequencing errors.

Thanks for the explanation!!