ksahlin/strobemers

An inconsistency between the source code and the manuscript

shenwei356 opened this issue ยท 13 comments

Hi Kristoffer, I found an inconsistency between the source code and the manuscript.

In paper section "Implementation":

... For randstrobes, ... we select the strobe mj in a window that minimizes the function h(m) - h(mj) mod q ...

Source code (order 2 and order 3 )is h(m) + h(mj) mod q.

min_index, min_value = argmin([ (hash_m1 + hash_seq_list[i][1]) % prime for i in range(start, stop)])

I guess it's a typo in the manuscript.

Hi @shenwei356,

Thanks for reporting! You are right I will fix this.

Overall, I think the h(m) - h(mj) mod q is slightly preferred due to the asymmetry of the function. However, there is a very important practical detail here. If h(m) - h(mj) mod qis implemented it needs to be coupled with throwing an error if the user specifies w_min = 0. The reason for this is that a 0 window offset will result in h(m_j) - h(mj) = 0 which means that the first strobe will always be chosen and completely destroys the randomness.

I will integrate to the library an assert w_min > 0 for all the interface function calling minstrobes, hybridstrobes and randstrobes and recommend this to other implementations too.

I will integrate to the library an assert w_min > 0 for all the interface function calling minstrobes, hybridstrobes and randstrobes and recommend this to other implementations too.

Thanks for the explanation. Yes, I also think w_min should be > 0

Hi @ksahlin , does the q has to be a prime number?

If q = pow(2,n) - 1, we could replace

(h(m) + h(mj)) mod q

with

 (h(m) + h(mj)) & q

The bitwise AND operator is much faster than the MOD operator. In my test, It could double the speed.

BTW, I think h(m) + h(mj) mod q is better than h(m) - h(mj) mod q.

h(m) - h(mj) could be zero when

  • m == mj, same position
  • seq(m) == seq(mj)`, same mer, may be in some low-complexity regions.
  • seq(m) == rev_comp(seq(mj)), could happen for short k/l

Yes, I think you're right. I don't see any issues right now with using h(m) + h(mj) mod q and q = pow(2,n) - 1.

Great that you get such speedup with the bitwise AND! I'm not familiar with Go myself but exciting to see what you will get from this.

Another concern: hash values could be negative:

for n=2: h(m1) - h(m2)
for n=3: h(m1) - h(m2) + 2h(m3) 

Hash values are usually uint64 or uint32 generated by popular hash functions. h(m1) - h(m2) could be negative, which could cause an error when assigning a negative value to uint type.

Another extreme cases: value of h(m1) + h(m2) may overflow maximum value of uint64 or uint32.

How about

for n=2: ( h(m1) + h(m2) ) / 2
for n=3: ( h(m1) + h(m2) + h(m3) ) / 3

Definitely! I used Python's hash function which returns signed integers of 36 bytes (unsure how many bits out of those are used to store the hash, but it's something massive). So for the Python implementation, I believe this was not a concern. But for other languages, this would be very important as you note.

Well actually, the functions you propose would lead to symmetric ones f(m1,m2,m3) = f(m1,m3,m2) = ... (and all permutations). This is of concern in the uniqueness aspect. The reason I chose those functions was that they are asymmetric.

Perhaps one could divide all of them with a unique prime? e.g.

for n=2:  h(m1)/p1 + h(m2)/p2 
for n=3:  h(m1)/p1 + h(m2)/p2 + h(m3)/p3

I see, the asymmetry should always be considerated in choosing m2, m3 and computation of the final hash values.

Perhaps one could divide all of them with a unique prime?

It's reasonable. A simple one:

for n=2:  h(m1) + h(m2)/2
for n=3:  h(m1) + h(m2)/2 + h(m3)/3

Yes, that should do it.

Just revisiting this. Couldn't h(m1) + h(m2)/2 and h(m1) + h(m2)/2 + h(m3)/3 theoretically still overflow? Then
h(m1)/2 + h(m2)/3 and h(m1)/2 + h(m2)/3 + h(m3)/5 should do it (of course it reduces the space of hash values a bit -> more collisions but may now be of big concern.

Yes, you're right. Sorry I forgot to renew it.

I used

h(m1)/2+h(m2)/3
h(m1)/3+h(m2)/4+h(m3)/5

Here are the details. https://github.com/shenwei356/strobemers#differences

Revisiting this hash-function business as I'm writing a strobemer program in C++ (good exercise to finally learn some C++:).

After a closer look at hash functions in C++, I realize it is difficult to design a good hash function (low collisions) that would squeeze strings into uint64. I'm currently using https://github.com/BGI-Qingdao/strobemer_cpptest/blob/master/strobemer.cpp#L12 which in turn was taken from minimap2 code. This hash function produces low values for short strings (i.e., not utilizing the full 64 bits to represent strings). For example, for a strobe of length 10, I only observe hash values less than 1,000,000. This probably will become a problem when storing the final hash of a strobemers using addition (as discussed above). The addition of multiple strobes hash values will still result in a small final hash value, which will cause unnecessary collisions.

I, therefore, switched from

final_hash_val = h(m1)/2+h(m2)/3

to

final_hash_val = (h(m1) << k) ^ h(m2)

I haven't implemented order 3 or higher yet so will tackle that problem when I get there. Not sure the above hash construction is the best way to go about it as I'm just learning C++. Nevertheless, thought I should mention this here in case anyone encounters similar problems.