veorq/SipHash

Consider shifting `left`over bits left before xoring with unshifted length onto v3

jberryman opened this issue · 2 comments

Here we shift the length left, truncating it to 8 bits:

https://github.com/veorq/SipHash/blob/master/siphash.c#L92

And here we combine it with any leftover bytes before xoring into v3:

https://github.com/veorq/SipHash/blob/master/siphash.c#L112

Instead what I'd like to do is xor the full unshifted inlen onto v3 and then xor the remaining (less than 8) bytes shifted all the way left. This would also mean for large inlen the length and remaining bits overlap somewhat and interact via xor, although we could first mask it if that were preferable for some reason.

My use case is my siphash implementation's inner loop is working on a stream of chunks of bytes of size 1, 2, 4, or 8. We don't know the length in advance, and accumulate it as we go, and don't know when we've received the last chunk (our "inner loop" is a function passed to a fold over chunks of bytes, in FP parlance).

Currently we have to carry around an extra word for accumulating enough bits to run through cROUNDS siphash rounds:

https://github.com/jberryman/hashabler/blob/master/src/Data/Hashabler/SipHash.hs#L195

If the algorithm were modified as I suggest, we could use v3 to accumulate bytes as we consume them, just shifting the input chunks left as needed to align them and xoring them onto v3.

So for now this is mostly a question: does such a modification sound reasonable, or do you think it's possible that this could negatively affect the properties of the hash function (by moving the length to the lower order bits)?

(I'm sorry not to include code or a PR; to be honest looking back at the C implementation again after a year I'm having trouble untangling pointer arithmetic and endian issues. It's possible my description above was also backwards... But modulo endianness my implementation passes using your test vectors)

veorq commented

If I understand correctly, what you propose is equivalent to different encoding of the input's length. So I don't think it would affect security in any way. But with this change you'd get different output values (for most of them), so it wouldn't match SipHash's test vectors. I'd like to keep the algorithm backward compatible as much as possible, so I don't think we'll make this change. But I'd be happy to review your custom version.

@veorq I appreciate that, thanks!