BIDData/BIDMat

hashing in BIDMat

huitseeker opened this issue · 1 comments

Grepping for hash instances in the project:

huitseeker@jollyjumper:~/tmp/BIDMat(master)$ ag Murmur -G'.*\.scala'
src/main/scala/BIDMat/DMat.scala
6:import scala.util.hashing.MurmurHash3
51:     out.setGUID(MurmurHash3.mix(MurmurHash3.mix(nr, nc), (GUID*3145341).toInt));

src/main/scala/BIDMat/FMat.scala
7:import scala.util.hashing.MurmurHash3
49:     out.setGUID(MurmurHash3.mix(MurmurHash3.mix(nr, nc), (GUID*3145341).toInt));

src/main/scala/BIDMat/GDMat.scala
16:import scala.util.hashing.MurmurHash3
46:     out.setGUID(MurmurHash3.mix(MurmurHash3.mix(nr, nc), (GUID*3145341).toInt));

src/main/scala/BIDMat/GIMat.scala
13:import scala.util.hashing.MurmurHash3
45:     out.setGUID(MurmurHash3.mix(MurmurHash3.mix(nr, nc), (GUID*3145341).toInt));

src/main/scala/BIDMat/GLMat.scala
10:import scala.util.hashing.MurmurHash3
49:     out.setGUID(MurmurHash3.mix(MurmurHash3.mix(nr, nc), (GUID*3145341).toInt));

src/main/scala/BIDMat/GMat.scala
13:import scala.util.hashing.MurmurHash3
43:     out.setGUID(MurmurHash3.mix(MurmurHash3.mix(nr, nc), (GUID*3145341).toInt));

src/main/scala/BIDMat/IMat.scala
5:import scala.util.hashing.MurmurHash3
42:     out.setGUID(MurmurHash3.mix(MurmurHash3.mix(nr, nc), (GUID*3145341).toInt));

src/main/scala/BIDMat/LMat.scala
5:import scala.util.hashing.MurmurHash3
42:     out.setGUID(MurmurHash3.mix(MurmurHash3.mix(nr, nc), (GUID*3145341).toInt));

src/main/scala/BIDMat/ND.scala
11:import edu.berkeley.bid.MurmurHash3
199:    MurmurHash3.MurmurHash3_x64_64(inds.map(_.GUID), 0x3142341)
203:    MurmurHash3.MurmurHash3_x64_64(inds.map(_.toLong), 0x3142341)
huitseeker@jollyjumper:~/tmp/BIDMat(master)$

I've found a lot of hashes on 32-bit Ints. Have you considered replacing Murmur3 with the faster xxHash ?
https://github.com/Cyan4973/xxHash
(as for the one usage in its 64-bit version, note the existence of XXH64)

THose hashes are used to build GUIDs for matrices which are being cached. So there is always a matrix operation behind them which should take a lot more time than the hash itself. But if you have some evidence that its slowing things down let us know. Minimizing the collision probability seems like the most important thing.