bcgsc/ntJoin

How will ntJoin handle gaps with Ns in a scaffold?

rajewski opened this issue · 2 comments

I've been assembling a genome using ABySS, LINKS, RAILS, Sealer and ntEdit and want to try ntJoin using a related species' reference to improve contiguity. My ABySS assembly has a lot of very small contigs (~150bp) so the N50 is low, but there are much larger contigs in the assembly. On the other hand, the scaffolded assembly has gaps that I wouldn't want filled with sequence from a distance reference.

Do you expect the results to be better if the preliminary ABySS assembly were fed directly ntJoin, or would an assembly scaffolded with LINKS and RAILS be preferable?

Hi @rajewski,

I'd expect ntJoin to work well on either the baseline ABySS assembly or the assembly after scaffolding. I've tried using ntJoin with a human assembly with N50 as low as 19kb, and I saw that ntJoin was still able to scaffold that pretty well with the human reference.

That being said, I think doing the LINKS and RAILS scaffolding steps with Sealer/ntEdit would be a good idea to get the best possible result. That's mainly because pieces that are around or smaller than the window size (w) won't be incorporated into scaffolds by ntJoin because of the minimizer-based approach, so you'd maximizing the number of potential sequences that could be placed with the initial scaffolding. Also, the Sealer step to fill in gaps would further help things.

We also only scaffold together sequences from the target assembly - the reference is just used for the scaffolding evidence, none of the actual reference bases are incorporated into the output assembly. If there are gaps between scaffolded target sequences, we just add Ns (The size of the gap is estimated based on the reference).

I hope that helps, and thank you for your interest in ntJoin!
Lauren

Thanks a bunch!