Candidate replacements for use with boost::unordered_flat_map
from
Boost 1.81, instead of the default boost::hash<std::string>
. Might
be added as a dedicated boost::string_hash
function object for
string hashing only, or directly as the implementation underneath
boost::hash<std::string>
et al.
As a basic primitive, the functions use mulx
, a multiplication of
two 64 bit numbers yielding a 128 bit result, then combining its two
64 parts with bitwise-xor. mulx
has very good combiner properties
for its cost (a single mul
instruction under x86), and has been
popularized by Vladimir Makarov as
the basis of his hash function MUM hash.
mulxp0_hash
is closest to boost::hash<std::string>
from Boost 1.81,
and uses the same basic state = combine( state, input );
round, with
combine(h, v)
being h ^ mulx( h + 1 + v, k )
, where k
is a constant.
Unlike boost::hash
, it adds a leading round of scrambling the initial
seed, and a trailing round of scrambling the final result. (These are
necessary to pass the
SMHasher battery of tests.)
Since this construction makes the next round dependent on the current one,
the loop is latency-bound. The mul
instruction has higher throughput than
latency, which can be exploited by changing the round to
state = combine( state, mix( input ) );
, with the combine
function
cheaper than the mix
one, a construct popularized by Austin Appleby's
MurmurHash.
mulxp1_hash
follows this approach, with a round of
h ^= mulx( v1 + w, k );
, where k
is, as before, a constant, and w
is a Weyl sequence (a
fancy way of saying that w
is updated on each iteration by adding a
constant to it, w += q;
.)
On sufficiently long inputs, where the parts outside the loop don't matter,
mulxp1_hash
is about 2.5 times as fast as mulxp0_hash
. (However, long
inputs don't really matter for our primary target use case, a hash function
for boost::unordered_flat_map<std::string, X>
, as most keys are small.)
mulxp2_hash
is a slight variation of mulxp1_hash
, using the same round,
which basically unrolls the main loop twice, consuming 16 bytes at a time.
This makes it slightly (10%) faster on long inputs, but slower on short
inputs.
The final member of the mulxp
family is mulxp3_hash
. It has been inspired
by the UMAC/NH family of functions
started by FARSH, continued with
MUM and perfected with
wyhash. These consume 16 bytes
(two words) of input at a time, v1
and v2
, and feed both to the mulx
primitive, as in mulx(v1 + k1, v2 + k2)
, where k1
and k2
are suitably
chosen constants that ideally would come from a secret one-time pad.
The advantage of this scheme is that the inner loop performs one fewer
multiplication per round, which ideally would double the speed. The
disadvantage is that if one of the operands to the mulx
primitive happens
to be zero, the other input word is essentially discarded and does not affect
the final hash value.
In mulxp3_hash
, the round operation is h ^= mulx( v1 + w, v2 + w + k );
,
where w
is a Weyl sequence as above, and k
is a constant. This makes
mulxp3_hash
approximately 1.5 times as fast as mulxp2_hash
.
Results from running rurban/smhasher:
Verification value 0xBE132908 ....... PASS
Average - 1.433 bytes/cycle - 4100.53 MiB/sec @ 3 ghz
Average 28.318 cycles/hash
Verification value is 0x00000001 - Testing took 1777.521000 seconds
No test failures.
Verification value 0xA476BA89 ....... PASS
Average - 3.734 bytes/cycle - 10682.90 MiB/sec @ 3 ghz
Average 23.890 cycles/hash
Verification value is 0x00000001 - Testing took 1664.050000 seconds
No test failures.
Verification value 0x85C9262A ....... PASS
Average - 4.061 bytes/cycle - 11619.43 MiB/sec @ 3 ghz
Average 26.145 cycles/hash
Verification value is 0x00000001 - Testing took 1667.155000 seconds
No test failures.
Verification value 0xB4786096 ....... PASS
Average - 5.921 bytes/cycle - 16941.48 MiB/sec @ 3 ghz
Average 21.886 cycles/hash
Verification value is 0x00000001 - Testing took 1641.350000 seconds
No test failures.
Verification value 0x8766204A ....... PASS
Average - 0.377 bytes/cycle - 1077.76 MiB/sec @ 3 ghz
Average 58.984 cycles/hash
Verification value is 0x00000001 - Testing took 1976.459000 seconds
Failures: PerlinNoise
Testing 16777216 coordinates (L2) :
Testing collisions ( 64-bit) - Expected 0.0, actual 16769025 (2197949775808.02x) (16769025) !!!!!
Testing collisions (high 32-bit) - Expected 32725.4, actual 16769025 (512.42x) (16736300) !!!!!
Testing collisions (high 27-42 bits) - Worst is 42 bits: 16769025/31 (524032.73x) !!!!!
Testing collisions (low 32-bit) - Expected 32725.4, actual 16769025 (512.42x) (16736300) !!!!!
Testing collisions (low 27-42 bits) - Worst is 42 bits: 16769025/31 (524032.73x) !!!!!
Testing AV variant, 128 count with 4 spacing, 4-12:
Testing collisions ( 64-bit) - Expected 0.0, actual 674604 (2595263322583.69x) (674604)
Testing collisions (high 32-bit) - Expected 1116.2, actual 675248 (604.98x) (674132) !!!!!
Testing collisions (high 25-37 bits) - Worst is 37 bits: 674627/34 (19337.02x) !!!!!
Testing collisions (low 32-bit) - Expected 1116.2, actual 675267 (605.00x) (674151) !!!!!
Testing collisions (low 25-37 bits) - Worst is 37 bits: 674622/34 (19336.88x) !!!!!
Verification value 0x59CBE0F0 ....... PASS
Average - 2.425 bytes/cycle - 6937.08 MiB/sec @ 3 ghz
Average 35.108 cycles/hash
Verification value is 0x00000001 - Testing took 1169.231000 seconds
Failures: Permutation, Window
Combination 128-bytes [0-last] Tests:
Keyset 'Combination' - up to 22 blocks from a set of 2 - 8388606 keys
Testing collisions ( 32-bit) - Expected 8186.7, actual 4065412 (496.59x) (4057226) !!!!!
Window at 58 - Testing collisions ( 32-bit) - Expected 128.0, actual 375 (2.93x) (248) !!!!!
Window at 59 - Testing collisions ( 32-bit) - Expected 128.0, actual 687 (5.37x) (560) !!!!!
Window at 60 - Testing collisions ( 32-bit) - Expected 128.0, actual 612 (4.78x) (485) !!!!!
Window at 61 - Testing collisions ( 32-bit) - Expected 128.0, actual 869 (6.79x) (742) !!!!!
Window at 62 - Testing collisions ( 32-bit) - Expected 128.0, actual 1000 (7.81x) (873) !!!!!
Window at 63 - Testing collisions ( 32-bit) - Expected 128.0, actual 1133 (8.85x) (1006) !!!!!
Window at 64 - Testing collisions ( 32-bit) - Expected 128.0, actual 1305 (10.20x) (1178) !!!!!
Window at 65 - Testing collisions ( 32-bit) - Expected 128.0, actual 831 (6.49x) (704) !!!!!
Window at 66 - Testing collisions ( 32-bit) - Expected 128.0, actual 488 (3.81x) (361) !!!!!
Window at 67 - Testing collisions ( 32-bit) - Expected 128.0, actual 308 (2.41x) (181) !!!!!
Verification value 0x8C944400 ....... PASS
Average - 3.706 bytes/cycle - 10602.00 MiB/sec @ 3 ghz
Average 32.018 cycles/hash
Verification value is 0x00000001 - Testing took 1157.755000 seconds
No test failures.
Verification value 0xD83303AC ....... PASS
Average - 0.460 bytes/cycle - 1315.91 MiB/sec @ 3 ghz
Average 64.906 cycles/hash
Verification value is 0x00000001 - Testing took 1301.033000 seconds
Failures: nearly all.
Results from running the
unordered_flat.cpp
benchmark
from Boost.ContainerHash, modified to test the mulxp
family of functions
in addition to boost::hash<std::string>
and absl::Hash<std::string>
.
boost::hash: 14990 ms
absl::Hash: 12352 ms
mulxp0_hash: 11533 ms
mulxp1_hash: 11716 ms
mulxp2_hash: 11800 ms
mulxp3_hash: 11101 ms
boost::hash: 17352 ms
absl::Hash: 14732 ms
mulxp0_hash: 13479 ms
mulxp1_hash: 13589 ms
mulxp2_hash: 14673 ms
mulxp3_hash: 13962 ms
boost::hash: 19053 ms
absl::Hash: 16200 ms
mulxp0_hash: 13496 ms
mulxp1_hash: 13651 ms
mulxp2_hash: 14072 ms
mulxp3_hash: 13213 ms
boost::hash: 17056 ms
fnv1a_hash: 16340 ms
absl::Hash: 19995 ms
mulxp1_hash32: 16282 ms
mulxp3_hash32: 16413 ms
boost::hash: 18677 ms
fnv1a_hash: 17226 ms
absl::Hash: 20264 ms
mulxp1_hash32: 17201 ms
mulxp3_hash32: 15195 ms