Runtime much slower than minimap2 if using higher segement
baozg opened this issue · 4 comments
Hi, all
I use Arabidopsis genomes for testing SyRI paf input, but I found wfmash running time is much slower than minimap2 if set higher -s
System: CentOS 7
fasta:
- TAIR10 : https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_chromosome_files/TAIR10_chr_all.fas.gz
- Ler0: https://www.ebi.ac.uk/ena/browser/api/fasta/GCA_900660825.1?download=true&gzip=true
Command:
wfmash -t 32 -p 95 -s 1k TAIR10.fa.gz Ler0.fa.gz
from conda(wfmash: v0.8.2)
wfmash -s 1k | wfmash -s 10k | minimap2 -ax asm20 --eqx -t 5 | |
---|---|---|---|
Time | 1:20 | 04:40.4 | 1:46 |
CPU | 2091% | 752% | 341% |
It is expected, because bigger -s
force the mappings to cover bigger structural variations, making the alignments harder.
For now, how to set -s
approximately? -s
no longer need to be exceed the length of large repeats.
Hard to say. When there are short sequences (length L), I use -s
<< L (for example, equal to L / 5 or L / 10). With longer sequences, I think of structural variations or repetitions, but I usually don't go beyond 50kbps. For A. thaliana, I have a bit of old experience with -s 10k
, which seemed a good tradeoff.
Thanks for explanation.