Callidon/bloom-filters

Question about implementation of double hashing

SimonWoolf opened this issue · 7 comments

Hi. I'm just looking into the hashing implementation here and I'm baffled by several aspects of the implementation, hoping you can tell me if there's anything I'm missing.

For example, in getIndices, you're rehashing the input on every iteration of the loop:

bloom-filters/src/utils.ts

Lines 206 to 209 in 5c81fa4

for (let i = 1; i <= hashCount; i++) {
const hashes = hashTwice(element, true, seed + size % i)
arr.push(doubleHashing(i, hashes.first, hashes.second, size))
}

Surely this defeats the point of double hashing, to simulate k independent hash functions given only two real ones? Not using double hashing at all would need to do one hash per index, your implementation does two hashes per index.

It's true that the hashes you're calculating on each loop aren't quite the same, because you're adding size % i -- that's the number of cells modulo the loop iteration -- to the seed each time. But why? That seems like a really strange thing to add to the seed. It doesn't guarantee that the seed is different on each loop (eg if the number of cells is even it'll be 0 for the first 2 iterations). But again that's not something you should want/need anyway in double hashing.

Am I missing something?

Does your question is more about the randomness property of having "random, uniform, and independent hash functions h1 and h2" before computing the double hashing? Or more focused on the fact we're rehashing hashCount times to get our indexes?

Or more focused on the fact we're rehashing hashCount times to get our indexes?

This. (2*hashCount, actually).

My understanding is that double-hashing lets you simulate k independent hash functions gᵢ(x) (1 ≤ i ≤ k) of some input x (to an output set of size m) with only two actual hash calculations, by taking gᵢ(x) = h₁ + i h₂ mod m (maybe plus some quadratic or cubic term as well). For example, you could simulate 10 independent hashes with only two actual hash calculations instead of ten.

But in your implementation, you're simulating 10 independent hashes with 20 actual hash calculations. You're doing twice the amount of work than even an implementation that didn't use double hashing at all and just calculated one hash per index would be doing?

I did an implementation (as a fork of bloomit which is a fork of bloom-filters) that just used two hash calculations and used double hashing to simulate all k, and it seems to work as expected: https://github.com/ably-forks/bloomit/blob/ae99007121121b28fbaab54f699658de5b249c4a/src/utils.ts#L113-L125 . But would be happy to be convinced I've missed something 🙂

Yes you're right! We should only hash once for getting our indexes.
Thank you for your question it seems that something went wrong and it has to be fixed 👍

In the commit I referenced I fixed how indexes are generated and replace getDistinctIndexes with getIndexes in most of the structures.
Usage of getIndexes is preferred, except for IBLTs where distinct indexes need to be generated.

The point is Bloom Filters or equivalent don't need to set data in distinct indexes.
In practice, distinct indexes are better yes but the structures are designed to deal with collision with certain probability.

In the different implementations we provide, only IBLTs require a set of distinct indexes.
It appears that generating those multiple distinct indexes in a range of [0, size) will be very slow due to modulo collisions. In some cases it can take a lot of time. That is why we hash multiples times the value by tweaking the seed. It has the effect of generating more CPU consumption but decrease drastically the number of iterations needed for getting all the desired indexes.

Just take a look at the new tests in

it('hash on each loop instead of once for getting distinct indices', () => {
function getDistinctIndexesWithOneHash(
element,
size,
number,
seed
) {
if (seed === undefined) {
seed = utils.getDefaultSeed()
}
let n = 0
const indexes = new Set()
const hashes = utils.hashTwice(element, true, seed)
while (indexes.size < number) {
const ind = utils.doubleHashing(
n,
hashes.first % size,
(hashes.second % size) + 1,
size
)
if (!indexes.has(ind)) {
indexes.add(ind)
}
n++
if (n % 10000000 === 0) {
console.log(`Only ${indexes.size} generated after ${n} loop. Stopping`)
throw Error('it must not take so much time to get our 7 indices !!')
}
}
return [...indexes.values()]
}
const elem = '44'
const hashcount = 7
const size = 9586
let start = new Date().getTime()
const indexes_a = utils.getDistinctIndexes(elem, 7, hashcount, size, seed)
console.log(`${indexes_a.length} distinct indices generated in ${new Date().getTime() - start} ms`)
start = new Date().getTime()
assert.throws(() => getDistinctIndexesWithOneHash(elem, 7, hashcount, size, seed), Error, 'it must raise because computing once the hashes is not efficient at all.')
})

Even after 1M iterations we just generated 1 index out of the 7 desired. With a hash per iteration we make much less iterations for generating the 7 indexes. Your example with even numbers is exactly one of the problems we avoid with this tweak.

BTW, for performance I suggest you to modify your implementation with the getIndexes function until a merge into the master.

bloom-filters/src/utils.ts

Lines 224 to 253 in 6e9cbcd

/**
* Generate N indexes on range [0, size)
* It uses the double hashing technique to generate the indexes.
* It hash twice the value only once before generating the indexes.
* Warning: you can have a lot of modulo collisions.
* @param element - The element to hash
* @param size - The range on which we can generate the index, exclusive
* @param hashCount - The number of indexes we want
* @return An array of indexes on range [0, size)
*/
export function getIndexes(
element: HashableInput,
size: number,
hashCount: number,
seed?: number
): Array<number> {
if (seed === undefined) {
seed = getDefaultSeed()
}
const arr = []
const hashes = hashTwice(element, true, seed)
for (let i = 0; i < hashCount; i++) {
arr.push(
doubleHashing(i, hashes.first % size, (hashes.second % size) + 1, size)
)
}
if (arr.length !== hashCount)
throw new Error('report this, please, shouldnt be of different size')
return arr
}

Can I ask why you're using farmhash instead of xxhashjs? I did not find any interesting benchmark.

Even after 1M iterations we just generated 1 index out of the 7 desired. With a hash per iteration we make much less iterations for generating the 7 indexes.

It's not a matter of "less iterations". You're getting into a cycle where h₂(x) == 0 mod 7, you could do 10^100 iterations and it would still never generate an index other than 4. (In theory anyway - in practice n can get high enough that n*h₂ starts approaching Number.MAX_SAFE_INTEGER and you start getting aliasing issues with putting integers in doubles, but that's hardly a good solution). The reason you're getting into a cycle is that you're adding 1 to the second hash (4098376680), which gives 4098376681, which is divisible by 7. This breaks the requirement for double hashing that h₂(x) must be coprime with the size. You can solve this with eg the enhanced double hashing algorithm described in the dillinger paper. (Which I've just realised I implemented wrongly in mine, I'll fix that).

That is why we hash multiples times the value by tweaking the seed. It has the effect of generating more CPU consumption but decrease drastically the number of iterations needed for getting all the desired indexes.

Again, if you're doing to rehash on every iteration there's literally no point at all in doing double hashing. The only point of double hashing in this context is to avoid computing a new hash for every index, but you're doing that anyway.

Can I ask why you're using farmhash instead of xxhashjs?

Uninteresting reasons (we're already using farmhash for other things, so reusing the dependency).

Tweaked my double hashing implementation: https://github.com/ably-forks/bloomit/blob/9cacea1cd1637b60be486dd15e8703b48b160aaa/src/utils.ts#L106-L141 (that's on the xxhash version, for a pr to bloomit). Properly implements enhanced double hashing, rehashes at most once every size iterations (which should be almost never, only on pathological inputs).

Ahah 👍 we worked exactly on the same thing but in different ways, and I'm actually surprised by yours. I agree, this will almost never happen because the number of hash functions is not very high in practice. I mean, I never see someone setting a hashCount of 1000. But just in case it will work.
Do you accept a merge of your work here? (see last commit, authorship added)

Do you accept a merge of your work here? (see last commit, authorship added)

Sure, go for it