best string hash function
ktprime opened this issue · 4 comments
the best string hash function for hash map.
https://github.com/wangyi-fudan/wyhash
from benchmark your str win is because fixed size string(no heap alloc -> cache friendly) and opt string compare
I add my emhash7(git emhash) into your integer bench:
bench_hash 0 sum: 0 avg lat: 4.8297
bench_hash 1 sum: 0 avg lat: 4.69109
bench_hash 2 sum: 0 avg lat: 3.46806
bench_hash 3 sum: 0 avg lat: 3.35624
bench_hash 4 sum: 0 avg lat: 4.9344
bench_hash 5 sum: 0 avg lat: 1.73721
bench_hash 6 sum: 0 avg lat: 1.64723
std::__1::map<unsigned int, unsigned short>, sum: 2893000 avg lat: 4.69555
std::__1::unordered_map<unsigned int, unsigned short>, sum: 2893000 avg lat: 7.24729
tsl::robin_map<unsigned int, unsigned short, std::__1::hash, std::__1::equal_to, std::__1::allocator<std::__1::pair<unsigned int, unsigned short>>, true>, sum: 2893000 avg lat: 1.14586
tsl::hopscotch_map<unsigned int, unsigned short, std::__1::hash, std::__1::equal_to, std::__1::allocator<std::__1::pair<unsigned int, unsigned short>>, 10, true>, sum: 2893000 avg lat: 3.34826
robin_hood::detail::Table<true, 80, unsigned int, unsigned short, robin_hood::hash, std::__1::equal_to>, sum: 2893000 avg lat: 3.61231
emhash7::HashMap<unsigned int, unsigned short>, sum: 2893000 avg lat: 1.54262
dense_hash_set, sum: 2893000 avg lat: 7.42706
bench_bsearch sum: 2893000 avg lat: 4.78543
huangyuingdeMBP:benchmark huangyuanbing$ ./benchfindstr < data.txt
bench_hash 0 sum: 44301000 avg lat: 5.00236
bench_hash 1 sum: 44301000 avg lat: 5.60526
bench_hash 2 sum: 44301000 avg lat: 4.78484
bench_hash 3 sum: 44301000 avg lat: 5.33511
bench_hash 4 sum: 44301000 avg lat: 5.94543
bench_hash 5 sum: 44301000 avg lat: 4.34096
bench_map sum: 44301000 avg lat: 21.6678
std::__1::unordered_map<std::__1::basic_string, unsigned short>, sum: 44301000 avg lat: 25.7266
tsl::robin_map<std::__1::basic_string, unsigned short, std::__1::hash<std::__1::basic_string>, std::__1::equal_to<std::__1::basic_string>, std::__1::allocator<std::__1::pair<std::__1::basic_string, unsigned short>>, true>, sum: 44301000 avg lat: 11.5332
tsl::hopscotch_map<std::__1::basic_string, unsigned short, std::__1::hash<std::__1::basic_string>, std::__1::equal_to<std::__1::basic_string>, std::__1::allocator<std::__1::pair<std::__1::basic_string, unsigned short>>, 10, true>, sum: 44301000 avg lat: 15.4038
robin_hood::detail::Table<true, 80, std::__1::basic_string, unsigned short, robin_hood::hash<std::__1::basic_string>, std::__1::equal_to<std::__1::basic_string>>, sum: 44301000 avg lat: 13.7637
emhash7::HashMap<std::__1::basic_string, unsigned short>, sum: 44301000 avg lat: 13.4242
emhash8::HashMap<std::__1::basic_string, unsigned short>, sum: 44301000 avg lat: 11.4544
emhash5::HashMap<std::__1::basic_string, unsigned short>, sum: 44301000 avg lat: 13.92
dense_hash_map, sum: 44301000 avg lat: 22.3428
bench_bsearch sum: 44301000 avg lat: 16.4077
bench_string_bsearch sum: 44301000 avg lat: 51.6053
Actually there's a little bug in benchfindint.cc where I recently changed IntT type but forgot to change IntLen accordingly. I have fixed it and submitted, now bench_hash<6> should be the fastest.
I have clone your project(https://github.com/ktprime/str) and add my emhash into bench code(rand input),
your solution is not alway fastest if input very large(n1 > 30000).
huangyuingdeMBP:benchmark huangyuanbing$ ./benchfindint 40001 123456
n1 = 40001
n2 = 123456
bench_hash 6 sum: 202587583000 avg lat: 11.9826
std::__1::unordered_map<unsigned short, unsigned short>, sum: 11272000 avg lat: 24.4841 factor 0.999466
tsl::robin_map<unsigned short, unsigned short, std::__1::hash, std::__1::equal_to, std::__1::allocator<std::__1::pair<unsigned short, unsigned short>>, true>, sum: 11272000 avg lat: 5.50371 factor 0.456711
tsl::hopscotch_map<unsigned short, unsigned short, std::__1::hash, std::__1::equal_to, std::__1::allocator<std::__1::pair<unsigned short, unsigned short>>, 10, true>, sum: 11272000 avg lat: 5.18199 factor 0.456711
robin_hood::detail::Table<true, 80, unsigned short, unsigned short, robin_hood::hash, std::__1::equal_to>, sum: 11272000 avg lat: 7.82956 factor 0.456711
emhash5::HashMap<unsigned short, unsigned short>, sum: 11272000 avg lat: 5.60857 factor 0.456711
emhash7::HashMap<unsigned short, unsigned short>, sum: 11272000 avg lat: 4.41366 factor 0.456711
emhash8::HashMap<unsigned short, unsigned short>, sum: 11272000 avg lat: 5.48717 factor 0.456711
dense_hash_set, sum: 202587583000 avg lat: 7.35589
bench_bsearch sum: 220819124800 avg lat: 70.9841
hyb@HYGPC171201001:/mnt/d/str/benchmark$ ./benchfindint 1001 100000
n1 = 1001
n2 = 100000
bench_hash 0 sum: 153361400 avg lat: 6.16645
bench_hash 1 sum: 153361400 avg lat: 5.96371
bench_hash 2 sum: 153361400 avg lat: 6.63877
bench_hash 3 sum: 153361400 avg lat: 6.28237
bench_hash 4 sum: 153361400 avg lat: 8.34494
bench_hash 5 sum: 153361400 avg lat: 6.3268
bench_hash 6 sum: 153361400 avg lat: 4.50323
std::unordered_map<short unsigned int, short unsigned int>, sum: 302200 avg lat: 16.8496 factor 0.967992
tsl::robin_map<short unsigned int, short unsigned int>, sum: 302200 avg lat: 8.63683 factor 0.487305
tsl::hopscotch_map<short unsigned int, short unsigned int>, sum: 302200 avg lat: 7.28711 factor 0.487305
robin_hood::detail::Table<true, 80, short unsigned int, short unsigned int, robin_hood::hash, std::equal_to >, sum: 302200 avg lat: 3.91013 factor 0.487305
phmap::flat_hash_map<short unsigned int, short unsigned int>, sum: 302200 avg lat: 3.2078 factor 0.487543
emhash5::HashMap<short unsigned int, short unsigned int>, sum: 302200 avg lat: 7.19306 factor 0.487305
emhash7::HashMap<short unsigned int, short unsigned int>, sum: 302200 avg lat: 6.95231 factor 0.487305
emhash8::HashMap<short unsigned int, short unsigned int>, sum: 302200 avg lat: 6.8073 factor 0.487305
dense_hash_set, sum: 153361400 avg lat: 11.1664
small n1 phmap(absl::flat_hash_map) is the fastest.