- MUM hash is a fast non-cryptographic hash function suitable for different hash table implementations
- MUM means MUltiply and Mix
- It is a name of the base transformation on which hashing is implemented
- Modern processors have a fast logic to do long number multiplications
- It is very attractable to use it for fast hashing
- For example, 64x64-bit multiplication can do the same work as 32 shifts and additions
- I'd like to call it Multiply and Reduce. Unfortunately, MUR (MUltiply and Rotate) is already taken for famous hashing technique designed by Austin Appleby
- I've chosen the name also as I am releasing it on Mother's day
- MUM hash passes all SMHasher tests
- For comparison, only 4 out of 15 non-cryptographic hash functions in SMHasher passes the tests, e.g. well known FNV, Murmur2, Lookup, and Superfast hashes fail the tests
- MUM algorithm is simpler than City64 and Spooky ones
- MUM is specifically designed for 64-bit CPUs (Sorry, I did not want to
spend my time on dying architectures)
- Still MUM will work for 32-bit CPUs and it will be sometimes faster Spooky and City
- On x86-64 MUM hash is faster than City64 and Spooky on all tests except for one
test for the bulky speed
- Starting with 240-byte strings, City uses Intel SSE4.2 crc32 instruction
- I could use the same instruction but I don't want to complicate the algorithm
- In typical scenario, such long strings are rare. Usually another
interface (see
mum_hash_step
) is used for hashing big data structures
- MUM has a fast startup. It is particular good to hash small keys which are a majority of hash table applications
- Input 64-bit data are randomized by 64x64->128 bit multiplication and mixing
high- and low-parts of the multiplication result by using an addition.
The result is mixed with the current state by using XOR
- Instead of the addition for mixing high- and low- parts, XOR could be
used
- Using the addition instead of XOR improves performance by about 10% on Haswell and Power7
- Instead of the addition for mixing high- and low- parts, XOR could be
used
- Prime numbers randomly generated with the equal probability of their bit values are used for the multiplication
- When all primes are used once, the state is randomized and the same prime numbers are used again for subsequent data randomization
- Major loop is transformed to be unrolled by compiler to benefit from the compiler instruction scheduling optimization and OOO instruction execution in modern CPUs
- x86-64 AVX2 insn
MULX
is used on processors where it is implemented- Using
MULX
increases the bulk speed upto 25% - Although on modern Intel processors
MULQ
takes 3-cylces vs. 4 forMULX
,MULX
permits more freedom in insn scheduling as it uses less fixed registers. That is a reason for the improvement - To make the code portable, GCC function multiversioning is used. The best code is chosen depending on the used processor features
- Using
- AARCH64 128-bit result multiplication is very slow as it is
implemented by a GCC library function
- To use only 2 insns for such multiplication one GCC asm extension was added
- Here are the results of benchmarking MUM and the fastest
non-cryptographic hash functions I know:
- Google City64 (sources are taken from SMHasher)
- Bob Jenkins Spooky (sources are taken from SMHasher)
- Yann Collet's xxHash64 (sources are taken from the original repository)
- Murmur hash functions are slower so I don't compare it here
- I also added J. Aumasson and D. Bernstein's SipHash24 for the comparison as it is a popular choice for hash table implementation these days
- Update:
- A metro hash
was added as people asked and as metro hash is
claimed to be the fastest hash function
- metro hash is not portable as others functions as it does not deal with unaligned accesses problem on some targets
- metro hash will produce different hash for LE/BE targets
- some people on hackernews pointed out that the algorithm is very close to xxHash one but still it is much faster xxHash
- A metro hash
was added as people asked and as metro hash is
claimed to be the fastest hash function
- Measurements were done on 3 different architecture machines:
- 4.2 GHz Intel i7-4790K
- 3.55 GHz Power7
- APM X-Gene Mustang Board
- Each test was run 3 times and the minimal time was taken
- GCC-4.8 was used for PPC and AARCH64 machines and GCC-4.9 was used for Intel one
-O3
was used for all compilations- The strings were generated by
rand
calls - The strings were aligned to see a hashing speed better
- No constant propagation for string length is forced. Otherwise, the results for MUM hash would be even better
- Update: Some people complaint that my comparison is unfair
as most hash functions are not inlined
- I believe that the interface is the part of the implementation. So when the interface does not provide an easy way for inlining, it is an implementation pitfall
- Still to address the complaints I added
-flto
for benchmarking all hash functions excluding MUM. This option makes cross-file inlining - xxHash64 results became worse for small strings and better for the bulk speed test
- All results for other functions improved, sometimes quite a lot
MUM | City64 | Spooky | xxHash64 | Metro64 | SipHash24 | |
---|---|---|---|---|---|---|
5 bytes (1,280M strings) | 7.74s | 10.80s | 9.22s | 9.57s | 8.23s | 21.56s |
8 bytes (1,280M strings) | 5.52s | 10.80s | 7.42s | 9.20s | 4.88s | 26.67s |
16 bytes (1,280M strings) | 7.45s | 9.42s | 18.29s | 11.05s | 7.32s | 25.89s |
32 bytes (1,280M strings) | 8.11s | 10.84s | 17.73s | 16.47s | 13.51s | 40.43s |
64 bytes (1,280M strings) | 11.51s | 12.01s | 23.24s | 18.03s | 14.55s | 61.75s |
128 bytes (1,280M strings) | 17.07s | 16.55s | 41.26s | 21.14s | 16.90s | 104.97s |
1KB (100M strings) | 6.61s | 6.53s | 8.31s | 6.67s | 6.18s | 53.86s |
MUM | City64 | Spooky | xxHash64 | Metro64 | SipHash24 | |
---|---|---|---|---|---|---|
5 bytes (1,280M strings) | 17.06s | 26.31s | 21.44s | 24.63s | 23.21s | 41.33s |
8 bytes (1,280M strings) | 8.64s | 26.34s | 14.41s | 20.16s | 10.44s | 51.19s |
16 bytes (1,280M strings) | 17.53s | 22.68s | 22.68s | 29.75s | 15.87s | 61.50s |
32 bytes (1,280M strings) | 18.63s | 25.56s | 29.76s | 37.07s | 26.64s | 79.15s |
64 bytes (1,280M strings) | 24.26s | 27.72s | 49.15s | 39.53s | 28.44s | 118.20s |
128 bytes (1,280M strings) | 32.82s | 38.16s | 78.92s | 47.45s | 31.82s | 193.42s |
1KB (100M strings) | 9.25s | 15.24s | 14.15s | 27.16s | 11.00s | 96.38s |
MUM | City64 | Spooky | xxHash64 | Metro64 | SipHash24 | |
---|---|---|---|---|---|---|
5 bytes (1,280M strings) | 22.40s | 27.21s | 21.34s | 27.21s | 19.73s | 54.41s |
8 bytes (1,280M strings) | 15.47s | 27.21s | 17.60s | 25.07s | 13.87s | 67.22s |
16 bytes (1,280M strings) | 22.40s | 24.54s | 32.54s | 48.55s | 19.20s | 83.23s |
32 bytes (1,280M strings) | 30.41s | 28.81s | 33.07s | 61.65s | 36.27s | 118.97s |
64 bytes (1,280M strings) | 47.48s | 32.01s | 56.02s | 76.82s | 37.88s | 186.73s |
128 bytes (1,280M strings) | 80.56s | 45.88s | 91.23s | 107.23s | 42.14s | 331.94s |
1KB (100M strings) | 38.39s | 24.51s | 22.55s | 44.85s | 29.59s | 170.01s |
- A major loop in function
_mum_hash_aligned
could be vectorized using vector multiplication, addition, and shuffle instructions - Unfortunately, x86-64 CPUs currently does not have vector
multiplication
64 x 64-bit -> 128-bit
- AVX2 CPUs only have vector multiplication
32 x 32-bit -> 64-bit
- One such vector instruction makes 4 multiplication which is
roughly equivalent what one
MULQ/MULX
insn does but it has bigger latency time thanMULQ/MULX
- Therefore all my vectorized code I tried was slower
- One such vector instruction makes 4 multiplication which is
roughly equivalent what one
- If Intel introduces a new vector insn for
64 x 64-bit -> 128-bit
multiplication, potentially it could increase MUM speed up to 2 times (may be less as it requires to provide aligned data)
- People worrying about denial attacks based on generating hash collisions started to use cryptographic hash functions in hash tables
- Cryptographic functions are very slow
- sha1 is about 20-30 slower than MUM and City on the bulk speed tests
- The new fastest cryptographic hash function SipHash is up to 10 times slower
- MUM is also resistant to preimage attack (finding a string with given hash)
- To make hard moving to previous state values we use mostly 1-to-1 one way
function
lo(x*C) + hi(x*C)
where C is a constant. Brute force solution of equationf(x) = a
probably requires2^63
tries. Another used function equationx ^ y = a
has a2^64
solutions. It complicates finding the overal solution further
- To make hard moving to previous state values we use mostly 1-to-1 one way
function
- If somebody is not convinced, you can use randomly chosen
multiplication constants (see function
mum_hash_randomize
). Finding a string with a given hash even if you know a string with such hash probably will be close to finding two or more solutions of Diophantine equations - If somebody is still not convinced, you can implement hash tables to recognize the attack and rebuild the table using MUM function with the new multiplication constants
- Analogous approach can be used if you use weak hash function as MurMur or City. Instead of using cryptographic hash functions all the time, hash tables can be implemented to recognize the attack and rebuild the table and start using a cryptographic hash function
- This approach solves the speed problem and permits to switch easily to a new cryptographic hash function if a flaw is found in the old one, e.g. switching from SipHash to SHA2
- Please just include file
mum.h
into your C/C++ program and use the following functions:- optional
mum_hash_randomize
for choosing multiplication constants randomly mum_hash_init
,mum_hash_step
, andmum_hash_finish
for hashing complex data structuresmum_hash64
for hashing a 64-bit datamum_hash
for hashing any continuous block of data
- optional
- To compare MUM speed with Spooky, City64, and SipHash24 on your machine go to
the directory
src
and run a script
sh bench
- The script will compile source files and run the tests printing the results
- MUM is not designed to be a crypto-hash
- The key (seed) and state are only 64-bit which are not crypto-level ones
- The result can be different for different targets (BE/LE
machines, 32- and 64-bit machines) as for other hash functions, e.g. City (hash can be
different on SSE4.2 nad non SSE4.2 targets) or Spooky (BE/LE machines)
- If you need the same MUM hash independent on the target, please
define macro
MUM_TARGET_INDEPENDENT_HASH
- If you need the same MUM hash independent on the target, please
define macro
- There is a variant of MUM called MUM512 which can be a candidate
for a crypto-hash function and keyed crypto-hash function and
might be interesting for researchers
- The key is 256-bit
- The state and the output are 512-bit
- The block size is 512-bit
- It uses 128x128->256-bit multiplication which is analogous to about 64 shifts and additions for 128-bit block word instead of 80 rounds of shifts, additions, logical operations for 512-bit block in sha2-512.
- It is only a candidate for a crypto hash function
- I did not make any differential crypto-analysis or investigated
probabilities of different attacks on the hash function (sorry, it
is too big job)
- I might be do this in the future as I am interesting in differential characteristics of the MUM512 base transformation step (128x128-bit multiplications with addition of high and low 128-bit parts)
- I am interesting also in the right choice of the multiplication constants
- May be somebody will do the analysis. I will be glad to hear anything. Who knows, may be it can be easily broken as Nimbus cipher.
- The current code might be also vulnerable to timing attack on systems with varying multiplication instruction latency time. There is no code for now to prevent it
- I did not make any differential crypto-analysis or investigated
probabilities of different attacks on the hash function (sorry, it
is too big job)
- To compare the MUM512 speed with the speed of SHA-2 (SHA512) and
SHA-3 (SHA3-512) go to the directory
src
and run a scriptsh bench-crypto
- SHA-2 and SHA-3 code is taken from RHash
- Update: Blake2 crypto-hash from github.com/BLAKE2/BLAKE2 was added for comparison. I use sse version of 64-bit Blake2 (blake2b).
- Here is the speed of the crypto hash functions on 4.2 GHz Intel i7-4790K:
MUM512 | SHA2 | SHA3 | Blake2B | |
---|---|---|---|---|
10 bytes (20 M texts) | 0.73s | 0.65s | 1.13s | 0.80s |
100 bytes (20 M texts) | 1.08s | 0.65s | 2.21s | 0.81s |
1000 bytes (20 M texts) | 3.90s | 4.82s | 15.3s | 3.31s |
10000 bytes (5 M texts) | 7.96s | 11.74s | 37.3s | 7.27s |
- Files
mum-prgn.h
andmum512-prng.h
provides pseudo-random functions based on MUM and MUM512 hash functions - All PRNGs passed NIST Statistical Test Suite for Random and
Pseudorandom Number Generators for Cryptographic Applications
(version 2.2.1) with 1000 bitstreams each containing 1M bits
- Although MUM PRNG pass the test, it is not a cryptographically secure PRNG as the hash function used for it
- To compare the PRNG speeds go to
the directory
src
and run a scriptsh bench-prng
- For the comparison I wrote crypto-secured Blum Blum Shub PRNG
(file
bbs-prng.h
) and PRNGs based on fast cryto-level hash functions in ChaCha stream cipher (filechacha-prng.h
) and SipHash24 (filesip24-prng.h
).- The additional PRNGs also pass the Statistical Test Suite
- For the comparison I also added the fastest PRNG xoroshiro128+ (it is not a crypto level PRNG)
- Update: I had no intention to tune MUM based PRNG first but
after adding xoroshiro128+ and finding how fast it is, I've decided
to speedup MUM PRNG
- I added code to calculate a few PRNs at once to calculate them in parallel
- I added AVX2 version functions to use faster
MULX
instruction - The new version also passed NIST Statistical Test Suite. It was tested even on bigger data (10K bitstreams each containing 10M bits). The test took several days on i7-4790K
- The new version is almost 2 times faster the old one and MUM PRN
speed became almost the same as xoroshiro128+ one
- xoroshiro128+ and MUM PRNG functions are inlined in the benchmark program
- both code without inlining will be visibly slower and the speed difference will be negligible as one PRN calculation takes only about 3.5 machine cycle for xoroshiro128+ and MUM PRN.
- Here is the speed of the PRNGs in millions generated PRNs per second on 4.2 GHz Intel i7-4790K:
M prns/sec | |
---|---|
BBS | 0.057 |
ChaCha | 106 |
SipHash24 | 383 |
MUM512 | 67 |
MUM | 1116 |
XOROSHIRO128+ | 1145 |
GLIBC RAND | 169 |