A minimal perfect hash function library based on CMPH.
Supports the following algorithms:
Other features:
- Pack/unpack each hash function to a
Span<byte>
- Logging is supported via Microsoft.Extensions.Logging
ChdBuilder<string> builder = new ChdBuilder<string>(NullLogger<ChdBuilder<string>>.Instance);
string[] data =
[
"elephant",
"goat",
"horse",
"cow"
];
if (!builder.TryCreate(data, out var state))
{
Console.WriteLine("Unable to create perfect hash function");
return;
}
foreach (string item in data)
{
Console.WriteLine($"Hashcode for {item}: {state.Search(item)}");
}
Output:
Hashcode for elephant: 9
Hashcode for goat: 10
Hashcode for horse: 6
Hashcode for cow: 1
Before diving into perfect hash functions, let me explain the challenges with a normal hash function. A normal hash function takes in, for example, a string and outputs an integer in the range [0, 2^32-1].
Let's say hash("goat")
gives us 4197513
In order to use that hash function in a hash table/set, we need to modulo the hash output with the number of items in the table/set.
var items = ["elephant", "goat", "horse", "cow"]
If we hash each of them and modulo with 4, we get the following values:
hash("elephant") % 4 = 1
hash("goat") % 4 = 0
hash("horse") % 4 = 2
hash("cow") % 4 = 1
As can be seen, both "elephant" and "cow" gets the same index. That is what we call a hash collision. In a hash table/set this has to be addressed, usually done via chaining or open addressing.
A Perfect Hash is a hash function that maps a set of n
keys to n
unique integers with no collisions. Therefore there is no need for collision resolution.
A Minimal Perfect Hash is a perfect hash function that has the added benefit of hashing to a range of [0, n-1].
There are usually "holes" in the output of a perfect hash:
PH("elephant") = 2
PH("goat") = 1
PH("horse") = 5
PH("cow") = 6
There are no holes in a minimal perfect hash:
MPH("elephant") = 3
MPH("goat") = 1
MPH("horse") = 0
MPH("cow") = 2
All:
- Moving large allocations out of loops
- Lazy loading lookup tables to reduce memory usage
- Some implementations had their number of iterations hardcoded. I've made them configurable.
- Some implementations used modulus to reduce the keyspace of the seed, but the hash function don't care, so I've removed the reduction.
BDZ:
- It did 100 iterations with the same 16 hash functions. It now does
n
iterations with random hash functions.
BMZ:
- Use 2 seeds instead of 3. The third seed was never used.
This library implements several PH/MPH functions intended to be used for mapping a value to an integer. Its primary use case is for mapping values in hash tables/sets.
It only benefits situations when:
- Data is completely static
- Your dataset is too big for other perfect hash functions
- You are using a mapping table and want to reduce memory usage
Benchmarks are sorted from fastest to slowest.
Dict
is the .NET Dictionary implementation._M
means it is the minimal variant of the hash function.
| Method | name | Mean | Allocated |
|---------- |------ |------------------:|----------:|
| Query | Dict | 8.708 ns | - |
| Query | BMZ_M | 18.199 ns | - |
| Query | BDZ | 19.355 ns | - |
| Query | CHM_M | 19.770 ns | - |
| Query | FCH_M | 19.973 ns | - |
| Query | CHD | 30.335 ns | - |
| Query | BDZ_M | 31.963 ns | - |
| Query | CHD_M | 44.504 ns | - |
| Construct | Dict | 8,638.064 ns | 31016 B |
| Construct | CHD | 49,194.324 ns | 39156 B |
| Construct | CHD_M | 67,581.087 ns | 47831 B |
| Construct | BDZ_M | 159,898.128 ns | 80575 B |
| Construct | BDZ | 164,998.722 ns | 250414 B |
| Construct | BDZ_M | 165,906.307 ns | 250334 B |
| Construct | CHM_M | 243,532.397 ns | 83903 B |
| Construct | FCH_M | 21,347,443.750 ns | 1321044 B |