marbl/MashMap

Mash distance?

Closed this issue · 3 comments

Hello Team,

It turns out that the mash equation in map_stats.hpp is a different one comapre with the fastANI/Mashmap2 code. The correct one should be:

float mash_dist = (-1.0 / k) * log(2.0 * j/(1+j) );

while it is this in mashmap3:

float mash_dist = 1 - std::pow(2*j / (1+j), 1.0/k);

They are not the same to me. Is this on purpose or something because the original mash equation should be the first one.

Thanks,
Jianshu

Hi @jianshu93!

The formula in MashMap3 should be more accurate, particularly for more divergent sequences. Please refer to the following text from Sextion 5.3 of The minimizer Jaccard estimator is biased and inconsistent

It was previously reported that the Mash formula’s use of a Poisson approximation makes it inaccurate for higher divergence (Ondov et al., 2019; Sarmashghi et al., 2019), so before proceeding further, we modified mashmap to replace this approximation with the exact Binomial-based derivation (we derive the correction formula in Supplementary Appendix A.6).

Hello Bryce,

Thanks. Forget that you mentioned in the paper somewhere. But for ANI calculation in fastANI, especially for below 85% ANI. For the same genome pair (same Jaccard index), new transformation will make identity larger (figure 1, green is the old one, blue is the new one). So in the end ANI will be even diverge from the alignment-based ANI (figure 2, fastANI old VS alignment-based ANI). Also the new estimator saturates faster than the old one toward low identity (in plot 1, y is identity, x is Jaccard index). Very interesting that theoretical new one should be more accurate but in practice it is not the case.

Thanks,

Jianshu

plot1

Rplot14.pdf

Hi @jianshu93

Correct, the new estimator has a higher ANI. It was shown that the the Poisson-based estimator previously used in MashMap would underestimate the ANI, particularly at very high identities.

In terms of the results you shared in the pdf, its hard to know exactly whats causing that without knowing the setup, however here are a few points I ran into when running similar analysis:

  1. The gap-compressed ANI is a better model than typical ANI. Gap-compressed ANI values will be at least as large as the typical ANI values, so that could help offset the bias you see.
  2. Because there is no extension step, the mapping coordinate depends on the seed. For very divergent sequences where there is high kmer dropout, the mapping positions for single segments will often be very noisy. This is not so much of an issue when you have multiple segments that are chained together, but when you have a single segment, the noise will cause noticable differences between the global alignment ANI of the mapped region and the predicted ANI. One fix for this is to make sure that you are (1) using the gap-compressed ANI and (2) using an alignment algorithm which does not penalize gaps on the end of the alignment.

Hope this helps!