Increased threshold for identity --pi generate more match dots

Question

Increased threshold for identity --pi generate more match dots

Closed this issue 5 years ago · 4 comments

Hi there,
I ran mashmap to compare the the contigs of one species to the genome of another species. But I found the dotplot showed more match dots when I set --pi as 90% than that when --pi was set as 85% (any other parameters are identical). What I understand is that higher --pi sets more strict constraints on matching, and the plot is expected to show less matching dots. Could you please give any interpretation about it?

--pi 85

--pi 90

Answer 1 · 2019-11-27T15:21:41.000Z

Hi, what filter mode -f are you using? You can try one-to-one mode to get best mappings among these. You're right that --pi 90 should be more strict; but since Mashmap only provides guarantees on recall and not precision, there is always a risk of false-positives (i.e., matches below specified cutoff) in repetitive genomes.
Also, in this case, with --pi 90, the implementation also switches to sparser sketch, which could also increase false positives.

Answer 2 · 2019-11-28T15:32:52.000Z

@cjain7 , thank you. I did set -f as one-to-one mode. And I tried --pi 95 later, and found it showing most clean dot plot (among 85, 90 and 95). So it seems that the number of matching fragments is not proportional to the threshold for identity. Like you mentioned, risk of false positive always exists, but this risk is supposed to be higher when --pi is lower, isn't it?
And why do you say --pi 90 makes the implementation switch to sparser sketch? Isn't that --kmer affects the inspected sequence?

--pi 95%

Answer 3 · 2019-11-28T16:22:58.000Z

And why do you say --pi 90 makes the implementation switch to sparser sketch?

The minimizer sampling density in the implementation is auto-tuned based on the specified length and identity cutoffs. The intuition is that with lower error-rate in the alignments, it suffices to sample fewer k-mers to locate them. This is done mainly to improve runtime and memory-usage.

but this risk is supposed to be higher when --pi is lower, isn't it?

This is hard to say, would really depend on input data.

Answer 4 · 2019-11-28T19:01:26.000Z

Thank you for the prompt reply. Your interpretation is very helpful.