Sequences with lots of Ns make things run 10x slower
tseemann opened this issue · 3 comments
tseemann commented
I've got this report for prokka which sounds like a minced bug:
tseemann/prokka#116
I'm guessing it finds a lot of repeats in those poly-N runs!
Need to mask long poly-runs of any base?
I have a sequence around 100k bp in length, but buffered at both ends with 'N's' so the total length of the sequence is 2.8 Mbp. Prokka gets stuck "searching for CRISPR repeats", and though it still finishes, takes >10x as long as annotating a 2.8 Mbp sequence with no Ns.
ctSkennerton commented
Yes, the original code was designed to work on completed genomes where long runs of Ns aren't a problem. I'll look into a fix for this
tseemann commented
ping
ctSkennerton commented
Should hopefully be fixed with new version 0.2.0