dolthub/swiss

[feature request] generic over the hash function

peter7891 opened this issue · 5 comments

If you look at the Rust api for std::collections::HashMap it is generic over the hashing function.
This is very useful for people to be able to choose, based on whether they care more about speed than security in the particular functionality they are using the map in.

I think, this would be a good feature to have here.

Thanks for the request @peter7891!

There are a few things to consider when choosing a hash function for a hash table. Hash functions with a poor output distribution can quickly degrade the performance of the hash table. This is especially true with closed-hashing hash tables and tables that use linear probing to resolve collisions (both apply to SwissMap). Secondly, if the hash function is not properly seeded, it can be susceptible to a hash-flooding attack.

There's a tradeoff between giving users more flexibility/choice and giving them enough rope to hang themselves. An argument against exposing custom hash functions is that the default hash is both fast and high quality. It's the same hash function used by the built in map that leverages AES instructions on supported platforms.

That said, I'm happy to hear other opinions and discuss further

Thanks for the request @peter7891!

There are a few things to consider when choosing a hash function for a hash table. Hash functions with a poor output distribution can quickly degrade the performance of the hash table. This is especially true with closed-hashing hash table and tables that use linear probing to resolve collisions (both apply in this case). Secondly, if the hash function is not properly seeded, it can be susceptible to a hash-flooding attack.

There's a tradeoff between giving users more flexibility/choice and giving them enough rope to hang themselves. An argument against exposing custom hash functions is that the default hash is both fast and high quality. It's the same hash function used by the built in map that leverages AES instructions on supported platforms.

That said, I'm happy to hear other opinions and discuss further

You are probably right. The Go compiler doesn't do many optimizations so its probably futile to try to optimize it.
I might fork and test to see if it will perform better.

I wrote some microbenchmarks to compare a few potential hash functions. The current hash function is dolthub/maphash.

goos: darwin
goarch: amd64
pkg: github.com/dolthub/maphash
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkCompareStringHasher
BenchmarkCompareStringHasher/dolthub/maphash
BenchmarkCompareStringHasher/dolthub/maphash-12         	273337081	         4.314 ns/op	       0 B/op	       0 allocs/op
BenchmarkCompareStringHasher/hash/maphash
BenchmarkCompareStringHasher/hash/maphash-12            	291626719	         4.068 ns/op	       0 B/op	       0 allocs/op
BenchmarkCompareStringHasher/xxHash3
BenchmarkCompareStringHasher/xxHash3-12                 	287016580	         4.138 ns/op	       0 B/op	       0 allocs/op
BenchmarkCompareStringHasher/hash/fnv
BenchmarkCompareStringHasher/hash/fnv-12                	73133010	        16.78 ns/op	       0 B/op	       0 allocs/op
PASS

That looks pretty good.
I wonder, how would aHash perform in Go.
https://github.com/tkaitchuck/aHash/blob/master/compare/readme.md

@andy-wm-arthur Just adding my 2-cents to this. We are using Swiss Map with keys that are already hashed and very evenly distributed. Our keys are [32]byte arrays. It would be great to be able to use them directly instead of hashing them again.