RFC: How should a digest_size > hash function's output be handled?
Opened this issue · 18 comments
How should a multihash implementation handle a call where the digest is longer than the specified output of that hash function? E.g. (pseudocode)
// The provided digest is 26 bytes long,
// but SHA1 can't produce a digest longer than 20 bytes
mh = multihash.encode(digest='abcdefghijklmnopqrstuvwxyz', code=Codes['sha1'])
Conversely, is the following multihash valid? (hex encoded, spaces added for legibility)
11 1a 6162636465666768696a6b6c6d6e6f707172737475767778797a
The code field declares the hash function is SHA-1 (0x11). The length field declares the digest is 26 bytes long, and the received digest field is 26 bytes long. However SHA-1 can't produce 26-byte digest.
When a multihash library is called to encode/decode such an input it could:
a. Signal an error, and stop encoding/decoding?
b. Set the length field to len(digest), then continue with processing?
c. Truncate the digest field, before continuing with processing?
The behaviour is currently unspecified, i.e. implementations can do whatever they want. AFAICT go-multihash does b. I'd like to propose that a. as a required behaviour for standards conforming implementations.
What do you think? Have I missed a specification document? Would you prefer I sent this to a mailing list (e.g. ipfs-users)?
Thanks for this, great bug catch. I think this should probably be an error. Though we could define all extra bytes to always be zeros. that way can compare sha1 and sha256 at same lengths or something.
we could define all extra bytes to always be zeros. that way can compare sha1 and sha256 at same lengths or something
I don't follow. What would be the use case for comparing digests from two different algorithms?
Kademlia DHT for XOR distance between digests
How niche (or otherwise) a use case are those? Would you really use a multihash encoded digest directly in a DHT, given the biasing that occurs in the first two bytes?
I'm worried that if padding bytes are introduced then it will lead to some vulnerability or error based on a false sense of security. I would very much prefer that behaviour a, (i.e. abort/signal an error) be the only conforming behaviour, or at leastt the strongly encouraged default behaviour.
If padding is to be allowed by the standard it should be an explicit choice such as passing a flag padding=true
, or invoking encode_padded()
instead of encode()
How niche (or otherwise) a use case are those?
it's not uncommon.
im not sure either way, yet. but i hear you, re errors and not creating a misleading level of bit security. that's likely the right answer.
+1 as @moreati for returning an error when lenght > lenght_of_digest_algorithm, both when computing a Sum() (as the utility function is called in the go-multihash implementation) and when packing a digest into a multihash value.
Padding should be handled by the application that requires comparison of hashes of different lenghts.
If indeed padding is what's needed. For instance, Kademlia implementations could decide to:
- rehash all multihashes in order to get a uniform hash length for keying and computing distances.
- pad a hash with itself instead of with zeroes. This would not add entropy, but would avoid clustering of shorter hashes.
- etc.
another option is to actually produce that many bytes of hash. for example, the "111 byte (888 bit) sha1 hash of 'foobar' "
x := "foobar"
buf := make([]byte, 111) // 888 bits.
copy(sha1(x), buf[0:20]) // 160 bits of sha1(x)
copy(sha1(x + "1"), buf[20:40]) // 160 bits of sha1(x + "1")
copy(sha1(x + "2"), buf[40:60]) // 160 bits of sha1(x + "2")
copy(sha1(x + "3"), buf[60:80]) // 160 bits of sha1(x + "3")
copy(sha1(x + "4"), buf[80:100]) // 160 bits of sha1(x + "4")
copy(sha1(x + "5"), buf[100:111]) // 88 bits of sha1(x + "5")
@jbenet But you can only do that when you have the blob (in this case, "foobar"), so it's useless for, for instance, doing Kademlia XOR comparisons when looking for the content corresponding to a sha1 multihash (your original example, if I understand correctly).
That's what I mean when I say that "padding (of a multihash) should be done by the application (that receives the multihash and has to operate on multihashes only, in absence of the corresponding blob)". Sorry for not being clearer earlier.
Well usually one ends up with a hash by finding it-- it was always generated by someone with the data. And most times the hash used for DHTs and so on is a hash of some data structure (the dag, not "the file"), which needs to be found as well. I'm not sure that use case precludes the thing I mentioned.
But I'm not sure the thing I mentioned is useful in any case
I feel like a fool for arguing with you about this. I'm going to put a pin in it as "maybe I need to understand it better".
Not at all, all of this is useful discussion.
cc @eternaleye for feedback
Honestly, if you need variable-length output longer than the hash, you fundamentally don't want a hash. You want a KDF. @jbenet, your Nov 19 post was basically an attempt at constructing a KDF from scratch, which I strongly recommend against.
Once that's recognized, the sensible options for multihash collapse down to HKDF; HMAC-based Key Derivation Function. Efficient (does only one pass over the IKM), supports both "context" and "salt" parameters, can produce arbitrary length, and can be instantiated with any hash.
In addition, I consider it essential that multihash include the target length in the input to the KDF - the reason for this is that, without doing so, an attacker can silently truncate a multihash in order to permit incorrect data to pass undetected.
Even so, a strict minumum hash length is necessary in order to prevent an attacker from cutting it really short, and then just bruteforcing.
Honestly, anything that is part of the multihash syntax (and thus out-of-band) should be included in the calculation (in HKDF, as part of the context) to avoid truncation attacks, algorithm substitution attacks, etc.
Another benefit of using HKDF: It uses a staged API, in which one calls two functions:
- Extract(IKM, salt) -> PRK // Extract a fixed-size pseudorandom key from a variable-sized input keying material and a salt
- Expand(PRK, context) -> OKM // Expand the PRK into arbitrary-length output keying material, bound to a specific context
This allows the expensive part (hashing the data) to be done once, and the cheap part (generating relatively short outputs) to reuse the result of that.
@eternaleye is there a way to do what you described with the following constraints:
- all current multihashes remain valid as they are
- functions that define how to "get more bits" use that as a default
- functions that dont define how to "get more bits" use a well-defined HKDF
Sounds to me like no, given that you think we should add lengths to the input to avoid silent truncation.
BTW silent truncation still has to run up against the expected length value. If the attacker can both modify the expected length value, and truncate it, then they could also modify the hash digest too, so not sure what's won. (in practice there may be attacks that depend on being able to modify only one byte (make the len tiny) instead of many (replace the whole digest with an attacker-chosen one).
Maybe TRTTD here is to define a different function "Multihash HKDF", give it a multihash code and all, which, given some other secure multihash function, constructs a HKDF. the function can embed the multihash code of the sub-function it's using in its digest. For example:
<fn-code-of-mhkdf><length-k><fn-code-of-subfn><digest-of-len-k-minus-1>
As for the original question, "how should a digest_size > hash function's output be handled?", it looks like an error for now.
Maybe TRTTD here is to define a different function "Multihash HKDF"
I would agree with that assessment. However, I would not recommend using the expanded output as the multihash value.
Specifically, I'd suggest the multihash value be the PRK, with the information of the underlying multihash passed in the salt.
Then, use cases which require a specific length use Expand(PRK, context = (length, use_case))
in order to derive what they need. A DHT using 256-bit IDs might use Expand(PRK, context = (256, "locator_dht_v1"))
, and so forth.
By now the "hash function's output length" is not explicitly known so implementations can't do anything about this case (see #114 for a fix). Given this information I'd say
- either multihashes MUST fill the digest length bytes that exceed the hash function's output length with
0x00
bytes and add an explicit warning. - or make this multihashes illegal by definition
- or explicitly allow any kind of values in the additional bytes and add an explicit warning.