bcgsc/ntHash

Example does not work

marekkokot opened this issue · 3 comments

Hi,

this example does not work:

nthash::NtHash nth("TGACTGATCGAGTCGTACTAG", 1, 5);  // 1 hash per 5-mer
while (nth.roll()) {
    // use nth.hashes() for canonical hashes
    //     nth.get_forward_hash() for forward strand hashes
    //     nth.get_reverse_hash() for reverse strand hashes
}

more precisely this is not working:

 nthash::NtHash nth("TGACTGATCGAGTCGTACTAG", 1, 5);  // 1 hash per 5-mer
    while (nth.roll()) {
        std::cout << nth.hashes()[0] << '\n';
        // use nth.hashes() for canonical hashes
        //     nth.get_forward_hash() for forward strand hashes
        //     nth.get_reverse_hash() for reverse strand hashes
    }

I get just one hash.
I'm not using cmake build, I just get all the sources needed and use

g++ *.cpp btllib/status.cpp

When I change to set the length it works:

nthash::NtHash nth("TGACTGATCGAGTCGTACTAG", 21, 1, 5);  // 1 hash per 5-mer

Also this works:

 std::string s = std::string("TGACTGATCGAGTCGTACTAG");
 nthash::NtHash nth(s, 1, 5);  // 1 hash per 5-mer

but no this:

nthash::NtHash nth(std::string("TGACTGATCGAGTCGTACTAG"), 1, 5);  // 1 hash per 5-mer

(the equivalent of the same as in the original example).

It took me a while to understand this.
Ctor captures const std::string& ref.
C++ guarantees that the temporal value (explicit like in the last code fragment, or implicit like in the original example) will live until the function end. The problem is the delegated ctor (const char*) does not copy the string content and we have a dangling pointer at the end.
Proposed solution:
maybe use memcpy like in the case of BlindNtHash ?
Although maybe there are reasons to not do this.
Anyway would be great to fix the example.

BTW. I had some issues cloning submodules, I have changed .gitmodules to

[submodule "vendor/argparse"]
	path = vendor/argparse
	url = https://github.com/p-ranav/argparse/
[submodule "vendor/btllib"]
	path = vendor/btllib
	url = https://github.com/bcgsc/btllib/

I'm not sure if it does make any difference.

BTW2.
Let's say I have a bit packet sequence, i.e. two bits per symbol in a form of uint64_t array.
Is there a way to use ntHash without unpacking it to std::string/cstring?
It's not crucial and may potentially be even worse vs efficient unpacking, but just out of curiosity.

Hi and thanks for your report and suggestions.

would be great to fix the example.

Agreed, at least until we find a better solution. I don't think using memcpy in NtHash::NtHash would be an efficient thing to do. It's ok for BlindNtHash since the input string is supposed to be just the first k-mer, but NtHash will get a whole sequence, and copying that would take extra time+memory.

BTW. I had some issues cloning submodules, I have changed .gitmodules

I used ssh to initialize the submodules, using HTTPS would be more reliable here. I'm going to remove all the submodules in the next release (soon) anyway to make ntHash easier to install/use.

BTW2.
Let's say I have a bit packet sequence, i.e. two bits per symbol in a form of uint64_t array.
Is there a way to use ntHash without unpacking it to std::string/cstring?
It's not crucial and may potentially be even worse vs efficient unpacking, but just out of curiosity.

This was coincidentally suggested to me in-person very recently. One would have to modify the ntf64 function to replace the SEED_TABLE lookup with the desired uint64_t value in the current implementation. I'll implement a better way to do this for the next release.

Hi,

thanks for the quick responses to all my questions!

I expected that you may want to avoid extra copying.
One suggestion that I think of is to replace const std::string& in ctor with std::string_view

Best
Marek

gf777 commented

this is still not working! It took me a while to realize that too :-)
It produces garbage values