bcgsc/ntHash

ntHash is not matching the article description?

luizirber opened this issue · 19 comments

Hi!

Last week I started implementing a Rust version of ntHash, and I can't make it match the output of ntHash (in this repo). I ended up following this answer from @karel-brinda on bfx stack overflow to implement a Python version too, and still couldn't match the output.

Then I checked the supplementary materials, and based on the NT64 function defined there implemented NTF64, NTR64 and NTC64... and couldn't match the output. Either I'm missing something obvious, or there is a bug in the optimized version in this repo...

I prepared a repo with a Jupyter notebook analysing some test cases, and you can Binder to have an interactive version to follow. I also added an example of using autocheck and catch to write an oracle test for an optimized and an unoptimized version of ntHash, I can open a PR to add it to this repo and set up Travis to run it.

Hi @luizirber, in the latest version pushed 6 days ago, we have changed the way we're computing the base hash value to have periodicity greater than 64. I think this is why you're not getting the same hash results as what we had in our older versions. Could you please let me know which commit you have used in your experiments?

commit 57af16a, the one from 6 days ago

Indeed, using ee4e9c3 works!

I will stick with 1.0.4 as a reference. But shouldn't it all keep working as before on master?

That's why you're getting different results. You're using the latest version where we're using a new approach for computing the hash values. Since in the past versions we had the issue of periodicity of 64 which is extended to 1023 using our new idea. Are you familiar with the periodicity problem for rolling hash?

@luizirber In future versions I'll make the periodicity to 65K or even 4Billion. This makes the computation much more accurate. Which implementation of rolling hash you're using in khhmer now?

But even with a 2-mer the hashes will be different? Are you planning to make a new version that is backward-incompatible?

I was trying it out for sourmash, but @standage was investigating ntHash to use it in khmer

@luizirber sorry, I didn't get your point. you mean the new version hash value vs older version are not the same? or there is incompatibility between hash values in the new version?

the new version hash (master) is different than the older version (1.0.4, for example).

It should be different, because the base hash function is different now. Let me give detailed explanation.

In the old version the hash value for a given k-mer s was computed using:

hVal = rol(s[0],k-1) ^ rol(s[1], k-2) ^ ... ^ s[k-1]

As an example suppose k = 65, then:

hval = rol(s[0],0) ^ rol(s[1],63) ^ rol(s[2],62) ^ ... ^ rol(s[64], 0)

The bold part is the periodicity issue which results in more collisions when k>64 since for positions where k%64 are the same we are cancelling out the corresponding bases if they are the same or we have the same hash values for palindromic combination, i.e.:

roll_hash(ACGGTGGCTCATGCCTGTAATCCTAGCACTTTGGGAAGCTGAGGTGAGTGGACTGCCTGAGCTCC)
=
roll_hash(CCGGTGGCTCATGCCTGTAATCCTAGCACTTTGGGAAGCTGAGGTGAGTGGACTGCCTGAGCTCA)

This will increase the collision rate. we call this the periodicity issue and when k > 64 we see a slight increase in false positive rates.

Now, in the new version we addressed this by changing the definition of rol and ror to new functions.

@luizirber In you current khmer check to see if you have this issue which I think it should be the case for khmer as well since you're using https://github.com/lemire/rollinghashcpp

I see, thanks!

Do you plan to name the new scheme ntHash2, or something like this? It will be very confusing if both are ntHash...

@luizirber As I thought this is an issue if you're using rolling hash, rollinghashcpp, in khmer:

#include "cyclichash.h"

using namespace std;

int main(int argc, char** argv) {
    std::string seq1 = "AAGTGTCAAACATTCAGACAACAGCAGGGGTGCTCTGGAATCCTATGTGAGGAACAAACATTCAG";
    std::string seq2 = "GAGTGTCAAACATTCAGACAACAGCAGGGGTGCTCTGGAATCCTATGTGAGGAACAAACATTCAA";
    const uint n(65);
    const uint L(64);
    CyclicHash<uint64> hf(n,L );
    cerr << hf.hash(seq1) << endl;
    hf.reset();
    cerr << hf.hash(seq2) << endl;
    return 0;
}

Output:

[hmohamadi@hpcg01 rollinghashcpp]$ g++ -o rolltest rolltest.cpp 
[hmohamadi@hpcg01 rollinghashcpp]$ ./rolltest 
15732650108616394752
15732650108616394752
[hmohamadi@hpcg01 rollinghashcpp]$

@luizirber Is it better to name it ntHash2? or the new commit would be OK if people in the community want to use it?

Speaking from experience, people will be confused if it has the same name but you can get different hash values depending on what day you downloaded it =]

For example, with sourmash I'm building sketches for public genomic databases, and I can't compare them if the hashes are not generated in the same way. This might not be an issue for people that discard the hashed data after analysis, but for me it's essential.

@luizirber sure let me update README and give detailed explanation why the rolling hash functions have the problem of periodicity and how we can resolve it and have a new version.

In sourmash, do you have an idea about the max sequence length you're going to hash?

Hi @luizirber, I just released the ntHash version 2 highlighting in the release note the difference between outputs.

For sourmash we don't see going further than 64-mers (more detailed explanation), so ntHash 1 would be fine.