Sequences with lots of Ns make things run 10x slower

Question

Sequences with lots of Ns make things run 10x slower

tseemann opened this issue 10 years ago · 3 comments

I've got this report for prokka which sounds like a minced bug:
tseemann/prokka#116

I'm guessing it finds a lot of repeats in those poly-N runs!

Need to mask long poly-runs of any base?

I have a sequence around 100k bp in length, but buffered at both ends with 'N's' so the total length of the sequence is 2.8 Mbp. Prokka gets stuck "searching for CRISPR repeats", and though it still finishes, takes >10x as long as annotating a 2.8 Mbp sequence with no Ns.

tseemann commented 10 years ago

ping

Answer 1 · 2015-06-18T06:14:38.000Z

Yes, the original code was designed to work on completed genomes where long runs of Ns aren't a problem. I'll look into a fix for this

Answer 2 · 2015-07-13T06:11:38.000Z

Should hopefully be fixed with new version 0.2.0