Consider shifting `left`over bits left before xoring with unshifted length onto v3
jberryman opened this issue · 2 comments
Here we shift the length left, truncating it to 8 bits:
https://github.com/veorq/SipHash/blob/master/siphash.c#L92
And here we combine it with any leftover bytes before xoring into v3
:
https://github.com/veorq/SipHash/blob/master/siphash.c#L112
Instead what I'd like to do is xor
the full unshifted inlen
onto v3
and then xor
the remaining (less than 8) bytes shifted all the way left. This would also mean for large inlen
the length and remaining bits overlap somewhat and interact via xor
, although we could first mask it if that were preferable for some reason.
My use case is my siphash implementation's inner loop is working on a stream of chunks of bytes of size 1, 2, 4, or 8. We don't know the length in advance, and accumulate it as we go, and don't know when we've received the last chunk (our "inner loop" is a function passed to a fold over chunks of bytes, in FP parlance).
Currently we have to carry around an extra word for accumulating enough bits to run through cROUNDS
siphash rounds:
https://github.com/jberryman/hashabler/blob/master/src/Data/Hashabler/SipHash.hs#L195
If the algorithm were modified as I suggest, we could use v3
to accumulate bytes as we consume them, just shifting the input chunks left as needed to align them and xor
ing them onto v3
.
So for now this is mostly a question: does such a modification sound reasonable, or do you think it's possible that this could negatively affect the properties of the hash function (by moving the length to the lower order bits)?
(I'm sorry not to include code or a PR; to be honest looking back at the C implementation again after a year I'm having trouble untangling pointer arithmetic and endian issues. It's possible my description above was also backwards... But modulo endianness my implementation passes using your test vectors)
If I understand correctly, what you propose is equivalent to different encoding of the input's length. So I don't think it would affect security in any way. But with this change you'd get different output values (for most of them), so it wouldn't match SipHash's test vectors. I'd like to keep the algorithm backward compatible as much as possible, so I don't think we'll make this change. But I'd be happy to review your custom version.