erthink/t1ha

Notes about t1ha_v3 design

erthink opened this issue ยท 6 comments

  1. Dirty and Fair kinds:
    • t1ha3_dirty = favour speed over quality: non-uniform, all hardware tricks (ie. AES-NI), one injection point and blinding multiplication are allowed;
    • t1ha3_fair = favour quality over speed: two injection points, no known drawbacks.
  2. Goals:
    • t1ha3_dirty: should be faster than competitors (e.g. xxHash_v3, MeowHash and wyhash_v777), while not inferior in quality; Or at least the same speed with better quality (i.e. no several drawbacks).
    • t1ha3_fair: proven excellence in the quality of competitors, with minimal performance degradation.
    • target architectures in order of priority: E2K, AMD64, AARCH64, RISC-V, MIPS64, MIPS32, ARM (no Thumb, but Thumb-2).
    • optional: additional variations of speed/quality/security tradeoffs.
  3. Endianness: No explicit LE/BE versions:
    • t1ha3_dirty: the native byte order is always used;
    • t1ha3_fair: Little Endian byte order is primary, bswap on BE platforms.
  4. Bitness:
    • result: 32, 64, 128.
    • implementations: 64, 32.
  5. Insights:
    • avoid MUM-mixer;
    • single header-only C99 implementation;
    • avoid unaligned read;
    • avoid unsafe user options;
    • SIMD-frendly;
    • non-injective mixers.

Schedule:

  • Most likely, only the final version will be published when there is no doubt about achieving the goals.
  • No specific deadlines are publicly declared. Sponsor the development if you need any guarantees.

congraduations! my Russian friend!

More details:

  1. I finally convinced myself to give up the Vladimir Markov's mum-mixer:
    • With good mixing, it is not uniform, loses entropy, and makes difficult to analyze the quality of hashing
    • None platforms can effectively vectorize it;
    • On all target platforms except AMD64, an additional "high-part" multiplication instruction is required.
    • Moreover, I got information (but don't ask me, please) that allows me to conclude that in all future generations of processors, the wide multiplication will be performed slightly slower (i.e. x86), and on architectures with a separate instruction (ARM, MIPS), multiplication will always start from scratch (without trying to use the result of the previous narrow multiplication). This is the complete opposite of knowing a few years ago. So, no mul_64x64_128() at all.
  2. UMAC approach but with two, three or four data-injections (or a Toeplitz construction) seems good enough for t1ha_v3:
    • Of course, will have to pay for this by speed, but it will allow to provide a proven level of quality, and support for a secret salt with an attack complexity of more than 2^64..128 (more analysis is required).
    • Most likely, the state will be based on the PRNG-kernel and have a bitness of 2-3-4 times greater than the result. This works great for E2K right now, but Haswell isn't doing so well.
    • Nonetheless, for the "dirty" kind of function I plan use Bulat Ziganshin's FARSH approach. I think this will allows get portable vectorizable code that runs faster than Wang Yi's wyhash, with slightly better quality (i.e. without mum-mixer flaws).
  3. The main difference from Yann Collet's XXH3 and Bulat Ziganshin's FARSH, seems, will be using only 64-bit operations, even multiplication:
    • This applies only to the "fair" 64-bit kinds of the function, but not to the "dirty" kind or 32-bit versions.
    • For this reason, the function will only run at full speed if all vectorized operations for 64-bit operands are available, including multiplication. This is excellent for E2K and x86 with AVX512, but consi-consa for AVX2/Neon, and bad for SSE2.
    • However, this solution seems reasonable for me, since it allows you to fully use the features of E2K and AVX512, and on AVX2 and Neon just works not bad. In any case, I don't think I should do an analog of XXH3, but I should move up.
  4. The 512-bit block size. There is no reason to do otherwise.
  5. Carefully selected mixers for short data. There will also be a set of fundamental differences from XXH3, but it's too early to put cards on the table.

E2K and x86 with AVX512

No! ๐Ÿ™…โ€โ™‚๏ธ

By targeting that, you support:

  • An obsure architecture that is only relevant in Russia. I, like many others, haven't seen or heard of E2K until you linked that article.
    • If I am not mistaken, these also run x86_64, no?
  • Intel chips designed since 2015, which is not a large target โ€” average consumers don't upgrade computers until they break or after ~10 years.
    • Not many programmers care about AVX512 โ€” it is mostly a gimmick. You will find most programmers (and AMD!) not going any higher than AVX2.
    • vpmullq has 15 cycles of latency on Skylake-X. It is faster to do 4 individual imulq instructions and use ILP, unless you are in a scenario where the cost is made up by the efficiency of the other operations.

AVX2, Neon, SSE2

Better.

By targeting this, you support:

  • All aarch64 devices
  • All iOS devices (except first gen iPhone which is more than irrelevant)
  • >99% of all Android devices (Some odd 32-bit ones lacked NEON)
  • Raspberry Pi
  • Every x86_64 processor
  • All x86 processors since Pentium 4

As for SSE, if you don't want to go as low as SSE2, I suggest SSE4.1, as that targets the important Core 2 Duo which is still found in the consumer market. You still need a compatibility check for that, as you can still run into the occasional Conroe or Pentium 4.

Target the present, stay compatible with the past, and leave room for the future.

You don't have to go so far as @Cyan4973 and I went with XXH3 (we wanted to keep the minimum close to XXH32), but keep in mind that the consumer will not be using the latest and greatest (or know what the most optimized program for their CPU is, instead downloading the 32-bit version because "it works")

Programmers like things they can just drop in without worrying about compatibility/slowing down.

Moreover, I got information (but don't ask me, please) that allows me to conclude that in all future generations of processors, the wide multiplication will be performed slightly slower (i.e. x86), and on architectures with a separate instruction (ARM, MIPS), multiplication will always start from scratch (without trying to use the result of the previous narrow multiplication). This is the complete opposite of knowing a few years ago. So, no mul_64x64_128() at all.

That is true, which is why XXH3 only uses it in medium lengths, as these multiplies cost 11 cycles on aarch64, 12 cycles on ARMv6, and are pretty expensive on every other 32-bit architecture.

As for aarch64, the cycle counts suggest that it only has a hardware 32->64 multiply and it emulates the full multiplies the same way it would be done on 32-bit in micro-ops.

That is why XXH64 is rather mediocre on it, only getting 3080 MB/s on the xxHash bench, compared to 3100 MB/s on XXH32 and 5800 MB/s on XXH3.

@easyaspi314, thank for feedback!

I'm doing a very different job now, but I think reasonable to clarify some aspect:

  • XXH3 by Yann (@Cyan4973) is perfect enough for now, i.e. we don't need a clone of XXH3.
  • E2K will be my main target platform for foreseeable future for a number of reasons. The most important thing is that this isa really serious technology claim, since a prototypes are significantly ahead of Intel/AMD (adjusted to the silicon lithography, which does not stand a chance without "proton light"). For clarity E2K could "run" x86 code like a Valgrind, just for compatibility for existing software.
  • AVX512 is not an objective, but a minimal basis corresponding to next E2K.

Wow, (magically) my t1ha_v3 draft have a lot in common with SHISHUA by @espadrine.
If things go on like this, there's nothing left for me.

gzm55 commented

newton and leibniz are both greatest mathematicians who independently invented calculus. look towards for your great work~