COBS postprocessing slower than COBS
karel-brinda opened this issue · 2 comments
This is a note for the future.
When too many matches are reported by COBS (eg large Illumina experiments with many matches due to reads being short), the post-processing of its output (despite being very simple, essentially just removing IDS and filtering matches based on the max best hits) becomes the bottleneck.
This is the filtering script: https://github.com/karel-brinda/mof-search/blob/08c28f8366ad35ffcb79f9953b3668494e47c38a/scripts/postprocess_cobs.py
Idea for the future: other matches can be directly skipped after the first match is rejected here (and not necessary to repeatedly extract #kmers): https://github.com/karel-brinda/mof-search/blob/08c28f8366ad35ffcb79f9953b3668494e47c38a/scripts/postprocess_cobs.py#L38