ctSkennerton/minced

Sequences with lots of Ns make things run 10x slower

tseemann opened this issue · 3 comments

I've got this report for prokka which sounds like a minced bug:
tseemann/prokka#116

I'm guessing it finds a lot of repeats in those poly-N runs!

Need to mask long poly-runs of any base?

I have a sequence around 100k bp in length, but buffered at both ends with 'N's' so the total length of the sequence is 2.8 Mbp. Prokka gets stuck "searching for CRISPR repeats", and though it still finishes, takes >10x as long as annotating a 2.8 Mbp sequence with no Ns.

Yes, the original code was designed to work on completed genomes where long runs of Ns aren't a problem. I'll look into a fix for this

ping

Should hopefully be fixed with new version 0.2.0