marbl/Mash

Change default murmurhash seed and/or expose as option?

Closed this issue · 5 comments

While I appreciate the Douglas Adams tribute, I wanted to raise the possibility of (1) changing the default murmurhash seed value to 0 (default in many bindings) and/or (2) exposing it as a command line flag.

The former would be useful for compatibility (both cross lang/lib, and with existing k-mer hashing code that uses murmurhash3). The latter may also be useful in facilitating easier simulation of many pseudorandom minmer subsets from the same file, and for interop with code that doesn't use a seed of 42.

Obviously (1) is a highly breaking change and therefor good to discuss and pin down prior to a preprint coming out. Thoughts on the marbl side?

Hi – quickly following up here. @ondovb + @aphillippy any thoughts on this?

A 3rd, 100% backwards compatible option is to just store the seed in the Cap'n Proto file so at the very least it's explicit vs. an undocumented implementation detail.

And hope this isn't annoying to re-raise – just trying to ensure that interop with Mash and other sketches will be simple in the future.

Option 3 would be non-invasive but still pave the way for the other options in the future; I don't see any reason not add that in a future release, but the first release is pretty much locked down. Adam?

The problem goes beyond the seed. Really you'd want to store a full description or identifier of the whole hash function (seed included). E.g. there's no guarantee Mash will forever use murmur3hash. So I'm in favor of exposing alternate hash functions as advanced options, but we need to think how to capture that in the sketch in a general way. Maybe pick a magic string that gets hashed and the resulting hash value is the identifier for the hash function?

ctb commented

This is important for #27! I like the magic string approach as a verification, in addition to having something like a simple documented table of hash functions. And yes, I had to dig into the source code to figure out why my hash functions were returning different values :).

The seed is now store and parameterized in v2.0. As mentioned in the release notes, there is a slight caveat here in that we had to ensure old versions would hard fail with a non-legacy seed, as they were not equipped to query it. Also for this reason, we kept the default as the legacy seed (42) for now to ensure maximum cross-version compatibility for typical users. This may change in the future as adoption of seed-aware versions increases.