Critics of t1ha-crc hash
Bulat-Ziganshin opened this issue · 6 comments
moved from rurban/smhasher#30
i don't recommend to use this crc-based hash (or any automatic selection of fastest hash as it's currently implemented in this package). while basic t1ha hashes (w/o crc/aes) look reasonable, crc-based hash is absolutely non-sense, made only for speed records rather than real work. in short, it computes 32-bit hash of each 64-bit input word, and then produces 64-bit hash based on these stripped numbers
@leo-yuriev, it will be great if you will finally remove this pseudo-hash from your package. Passing SMHasher doesn't guarantee that your hash is strong, you need also to understand how to change smhasher to make your hash fail. Your idea that users should understand CRC math and research each hash they are using yourself, is absolutely laughable - real users are much less educated even than you. And finally, idea to mix weak crc hash with reasonable main t1ha hash with automatic hash selection based solely on speed, deceives these users by pretending them that they are using reasomnnable main hash while automatic selection actually chooses crc hash in most cases. You are started with good idea, but then lost sense trying to setup meaningless records and relying on automatic smhasher testing rather than understanding of your own algorithms.
t1ha-crc hash is weaker than xxhash and so on. essentially, it's a crc32 hash arttifically widened to 64 bits, and you know that even crc32 can't pass half of smhasher tests. t1ha-crc weakness can be detected by the various smhasher tests checking 40-64 bit keys IF these keys will be preceded by fixed 40-byte header
Ok, I see now what you meant. Agreed.
@Bulat-Ziganshin, inform that I agree with your suggestion to remove t1ha_ia32crc().
I ran some tests and also came to the conclusion that in the case CRC32C a performance boost does not justify the quality problems.
Also I tried to add more injections points. This solves quality problems, but with substantial slowdown. So, seems that CRC32C is useless for any 64-bit checksum/hash (RIP).
@leo-yuriev Are you able to formalize your quality tests? Like bash scripts or such?
There are many low-quality hash functions in smhasher, they just have be labelled as such.
@rurban, such test is simple/obviously and reasonable for the t1ha-crc, but not for other hashes.
We can seen here that for a keys, longer than the 32 bytes, only half of entropy will be taken. It is obviously, because _mm_crc32_u64() returns only 32 bits from 64 sourcing.
Therefore, for example:
- let we have the key length of 33 bytes;
- let iterate all value of the second 64-bit word inside it and compute hash;
- this gives the 2^64 different values of key, but only the 2^32 values of hash.
Again, it's been obvious initially, but could be discussed the tradeoff between performance gain and a loss of quality.
My last test is about how bad this in some real applications and once again to weigh a tradeoffs.
As a result, I came to the conclusion that the "ratio" a loss of quality to a gain of performance is unjustified here.
@rurban, in case of SMHasher we could add a testcase (as explained above) which will break all CRC-based hashes.
It is no required to iterate all 2^64 values, but N*2^16 is required.
In other words just make a https://en.wikipedia.org/wiki/Birthday_attack, via the second 64-bit word of key (for t1ha-crc case).
Just crc hash it twice then. The upper and lower half. int shifts are trivial.
Adding birthday attacks are planned with rurban/smhasher#20