lh3/miniprot

Limit to reported alignments?

Closed this issue · 1 comments

The latest version was able to index a 22.5 Gb genome (1.75 million scaffolds) in 32 min using 16 cores and 99 Gb RAM, and align a file of 51,751 proteins to the index in 31 min using 16 cores and 42 Gb RAM. Thanks to @lh3 for the quick fixes! The output GFF file reports multiple alignment positions for many proteins, which is expected due to an abundance of pseudogenes in this assembly. The distribution of number of alignment positions appears to be truncated at 51 - there are 2513 proteins with 51 reported alignment positions, and no proteins with any more than that. Is this the expected behavior? In this assembly, it would not be unreasonable to see hundreds of alignment positions for some proteins.

lh3 commented

Glad to know miniprot works on your 22 Gb fragmented assembly in reasonable time. Thanks for testing!

If you want to see more alignments, increase both -N and --outn to something like:

miniprot -N 1000 --outn=1000

N controls how many hits miniprot evaluates internally. Increasing its value will make miniprot run slower. --outn controls how many hits to output. It doesn't affect performance much.