/umash

UMASH: a fast enough hash and fingerprint with collision bounds

Primary LanguagePythonMIT LicenseMIT

UMASH: a fast almost universal 64-bit string hash

amd64 aarch64

STATUS: the hash and fingerprint algorithms are finalized, and so is the mapping from umash_params_derive inputs to UMASH parameters. However, the ABI is not finalized; in particular, passing random bytes to umash_params_prepare may still result in different parameters.

UMASH is a string hash function with throughput (10.9 byte/cycle, or 39.5 GiB/s on an EPYC 7713) and latency (24 to 48 cycles for input sizes up to 64 bytes on the same machine) comparable to that of contemporary performance-optimised hashes like XXH3, HalftimeHash, or MeowHash (or aHash for 🦀 coders). Its 64-bit output is almost universal, and it, as well as both its 32-bit halves, passes both Reini Urban's fork of SMHasher and Yves Orton's extended version (after expanding each seed to a 320-byte key for the latter).

This C library has also been ported to little-endian aarch64 with the crypto extensions (-march=armv8-a+crypto). On the Apple M1's 3.2 GHz performance cores, the port computes the same function as the x86-64 implementation, at a peak throughput of 16 byte/cycle (49.2 GiB/s), and 30 to 52 cycle latency for short input up to 64 bytes.

Unlike most other non-cryptographic hash functions (CLHash and HalftimeHash are rare exceptions) which do not prevent seed-independent collisions and thus usually suffer from such weaknesses, UMASH provably avoids parameter-independent collisions. For any two inputs of s bytes or fewer, the probability that a randomly parameterised UMASH assigns them the same 64 bit hash value is less than ceil(s / 4096) 2**-55.

UMASH also offers a fingerprinting function that computes a second 64-bit hash concurrently with the regular UMASH value. That function's throughput (7.5 byte/cycle, 25.8 GiB/s on an EPYC 7713) and latency (37 to 74 cycles for inputs sizes up to 64 bytes on the same machine) comparable to that of classic hash functions like MurmurHash3 or farmhash. Combining the two hashes yields a 128-bit fingerprint that collides pairs of s-or-fewer-byte inputs with probability less than ceil(s / 2**26)**2 * 2**-83; that's less than 2**-70 (1e-21) for up to 5 GiB of data.

See umash_reference.py (pre-rendered in umash.pdf) for details and rationale about the design, and a proof sketch for the collision bound. The blog post announcing UMASH, and this other post on the updated fingerprinting algorithm include higher level overviews and may provide useful context.

If you're not into details, you can also just copy umash.c and umash.h in your project: they're distributed under the MIT license. For extra speed (at the expense of code size) add umash_long.inc as well, also distributed under the MIT license.

The current implementation only build with gcc-compatible compilers that support the integer overflow builtins introduced by GCC 5 (April 2015) and targets x86-64 machines with the CLMUL extension (available since 2011 on Intel and AMD), or aarch64 with the "crypto" extension (for PMULL).

Quick start

Here's how to use UMASH for a simple batch hash or fingerprint computation.

First, we need to generate struct umash_params that will define the parameters ("key") of the UMASH hash or fingerprint function.

For a hashing use case, one could fill a struct umash_params params with random bits (e.g., with a getrandom(2) syscall), and call umash_params_prepare(&params) to convert the random bits into a valid key. This last call may fail by returning false; however, the probability of that happening are astronomically small (less than 2**-100) if the input data is actually uniformly random.

Fingerprinting often needs a deterministic set of parameters that will be preserved across program invocations. For that use case, one should either fill a struct umash_params with hardcoded random contents before calling umash_params_prepare, or use umash_params_derive to deterministically generate an unpredictable set of parameters from a 64-bit value and a 32-byte secret.

For a fingerprinting use case, each program should use its own 32-byte secret.

Given a fully initialised struct umash_params params, we can now call umash_full or umash_fprint to hash or fingerprint a sequence of bytes. The seed argument is orthogonal to the collision bounds, but may be used to get different values, e.g., when growing a hash table afer too many collisions. The fingerprint returned by umash_fprint is simply an array of two hash values. We can compute either of these 64-bit hash values by calling umash_full: letting which = 0 computes the first hash value in the fingerprint, and which = 1 computes the second. In practice, computing the second hash value is as slow as computing a full fingerprint, so that's rarely a good option.

See example.c for a quick example.

$ cc -O2 -W -Wall example.c umash.c -mpclmul -o example
$ ./example "the quick brown fox"
Input: the quick brown fox
Fingerprint: 398c5bb5cc113d03, 3a52693519575aba
Hash 0: 398c5bb5cc113d03
Hash 1: 3a52693519575aba

We can confirm that the parameters are constructed deterministically, and that calling umash_full with which = 0 or which = 1 gets us the two halves of the umash_fprint fingerprint.

Hacking on UMASH

The test suite calls into a shared object with test-only external symbols with Python 3, CFFI, and Hypothesis. As long as Python3 and venv are installed, you may execute t/run-tests.sh to download test dependencies, build the current version of UMASH and run all the pytests in the t/ directory. t/run-tests-public.sh only exercises the public interface, which may be helpful to test a production build or when making extensive internal changes.

The Python test code is automatically formatted with black. We try to make sure the C code sticks to something similar to the FreeBSD KNF; when in doubt, whatever t/format.sh does is good enough.

We are also setting up Jupyter notebooks to make it easier to compare different implementations, to visualise the results, and to automatically run a set of statistical tests on that data. See bench/README.md for more details.

Help wanted

The UMASH function is now frozen, but the implementation isn't. In addition to regular maintenance and portability work, we are open to expanding the library's capabilities. For example:

  1. We currently only use incremental and one-shot hashing interfaces. If someone needs parallel hashing, we can collaborate to find out what that interface should look like.
  2. How fast could we go on a GPU?