marbl/Mash

Bootstrapping Mash

Opened this issue · 0 comments

Alignment-free methods are commonly criticized for lacking support of bootstrapping. Indeed, there so far have been few papers on computing support values without a MSA (1, 2). However, I think that Mash has the potential to implement true bootstrapping for more confident estimations. Quoting your 2016 paper:

Because S(A∪B) is a random sample of A∪B, the fraction of elements in S(A∪B) that are shared by both S(A) and S(B) is an unbiased estimate of J(A,B).

If one chose a different random sample one would get a different but hopefully similar estimate. These bootstrapped distances would lead to bootstrapped distance matrices and bootstrapped phylogenies. Given a number of them, one could compute the consensus tree and support values for each branch.

As the sample is mainly dependent on the hash values, getting a different sample should be as easy as using a different seed value in the hash function (an old attempt of mine). Unfortunately, the seed parameter of MurmurHash does not contribute enough to the hash value for it to be a completely new sample. One could instead switch to SipHash which will not only be slower but will have another string of consequences.

So, yeah; I think bootstrapping would be a cool feature that could provide a big benefit (support values) to resulting phylogenies.