proposal: hash: export a built-in hash function for comparable values
bcmills opened this issue ยท 30 comments
Background
The Go runtime contains an efficient hash function that it uses to implement built-in maps.
The runtime hash function is capable of hashing arbitrary comparable values, and usually cannot be pruned away during linking: for example, it would be very difficult for the linker to determine which hash functions are needed in a program that uses reflect.MapOf
to construct maps at run-time, or a program that stores interface{}
values in a map[interface{}]Anything
.
Occasionally, we need an efficient hash function outside the standard library, too: for example, we might want to implement other hash-based data structures, such as hash array mapped tries, concurrent tries, or prototypes for potential additions to the Go standard library (e.g. sync.Map), or write application code to compute hashes for normally-incomparable types (such as slices or maps) to be able to store them in built-in maps. (For the latter case, a built-in hash function would be a building block for a larger application-level hash function.)
Typical workarounds include:
- pushing computation of hashes to application code
http://godoc.org/github.com/apmckinlay/gsuneido/util/hmap#Key
http://godoc.org/github.com/hit9/htree#Item
http://godoc.org/github.com/lleo/go-hamt-key#Key
http://godoc.org/github.com/Workiva/go-datastructures/trie/dtrie#New - restricting application code to strings, byte-slices, integers, or pointers:
http://godoc.org/github.com/Workiva/go-datastructures/trie/ctrie#Ctrie.Insert
http://godoc.org/github.com/lalonde/hamt#Key
https://github.com/crawshaw/readmap/blob/master/readmap.go
http://godoc.org/github.com/orcaman/concurrent-map#ConcurrentMap.Set
http://godoc.org/github.com/OneOfOne/cmap#CMap.Set - or, occasionally, using
reflect
orunsafe
to reimplement arbitrary hashing in application code:
http://godoc.org/github.com/dlsniper/anyhash
http://godoc.org/github.com/mitchellh/hashstructure
http://godoc.org/github.com/cnf/structhash
I believe that the toil involved in implementing these workarounds discourages experimentation with more efficient data structures in Go, and needlessly complicates application code.
It would be reckless for us to expose a stable hash function in the standard library. However, it seems like it would be useful to expose a per-process hash function, randomly salted at each invocation of the process, in order to facilitate the use of custom data structures in user code.
Proposal
I propose that we add the following function to the hash
package:
// Local returns a hash code for a comparable argument using a hash function
// that is local to the current invocation of the program.
//
// Two values that compare equal using the == operator (and values that are
// unequal but have the same type and representation in memory, such as
// floating-point NaNs with the same bitwise representation) will have the same
// hash code. Values that compare unequal will have a high probability of
// returning different hash codes.
//
// Local panics if the argument is not comparable.
//
// Each call to Local with the same value will return the same result within the
// lifetime of the program, but each run of the program may return different
// results from previous runs.
func Local(interface{}) uintptr
Also this would allow us to use a very efficient version of sync.Map that allows sharding by a key's hash.
It does feel weird to me though that this would go in the runtime
package. Perhaps the reflect
package instead?
I'm aware of where the function is implemented, but that doesn't always correlate to where it should be categorically located. Arguably a lot of logic in reflect is in the "runtime", but we don't move it all to the runtime
package either.
@OneOfOne, sync.Map
can technically already use the runtime's hash functions (using //go:linkname
shenanigans), so by itself it's not a very compelling use case. However, an exported built-in hash would make it easier to experiment with changes or variations.
@dsnet I don't feel strongly about which package it should go in, but reflect
seems like a bit of an awkward fit: the only type-introspection required in the proposed API is the knowledge that the value is comparable.
Previously proposed, more or less, in this golang-dev thread: https://groups.google.com/d/msg/golang-dev/vd5Nnt3Wjj0/a5XFu6d9wsIJ .
Any hash function can only be discussed in conjunction with an equality function, so the docs for your proposed function should make clear that values that are ==
in Go terms will get the same hash value, and that values that are comparable but !=
will get different hash values.
You suggest that this should only be used for comparable values, but of course that means that it can not be used for slices, which would seem to leave out many real world applications. (The current runtime hash will panic if used on a slice, as can be seen using a map[interface{}]bool
.) Are you sure you want to restrict only to comparable types?
If we do this I definitely don't think it should be in the runtime package. That would suggest that it is specifically the hash function used by things like map, and even if that is in fact what is implemented, it should not be implied.
Previously proposed, more or less, in this golang-dev thread:
Thanks for the link; an interesting read. It looks like @randall77's objection there was mainly that we would have to specify and freeze the algorithm, but here I'm suggesting the opposite: if we expose a hash function, we should randomize it so that folks won't accidentally rely on the particulars.
You suggest that this should only be used for comparable values, but of course that means that it can not be used for slices, which would seem to leave out many real world applications.
That's true, but having the built-in hash would make it much easier to define application hash functions for slices: it is likely that one could use the built-in hash function for the individual elements and write only a very small function to combine them.
func hashSlice(s []T) uintptr {
h := uintptr(0)
for _, e := range s {
h = Hash([2]uintptr{h, Hash(e)})
}
return h
}
Are you sure you want to restrict only to comparable types?
I think that's a good starting point since, as you note, a hash function must be paired with an equality function. The language spec already defines an equality function, so piggybacking on that minimizes the amount of new information one would need to learn in order to use the function.
If we do this I definitely don't think it should be in the runtime package.
That's fair. Any suggestions for alternatives? It could plausibly live in hash
(or a subpackage thereof), reflect
(as suggested by @dsnet), or perhaps a subpackage of container
.
Any hash function can only be discussed in conjunction with an equality function, so the docs for your proposed function should make clear that values that are == in Go terms will get the same hash value, and that values that are comparable but != will get different hash values.
Updated, with an explicit caveat for NaN values (to avoid the confusion of #20660).
Given that the standard library already has a package named "hash", I would vote for that.
You suggest that this should only be used for comparable values, but of course that means that it can not be used for slices, which would seem to leave out many real world applications.
Comparability is well defined in Go. If we expand this, would it still perform a shallow hash of pointers or would it hash the sub-value pointed to by the pointer?
If we expand this Hash
to operate on incomparable types, ignoring hash collisions, would Hash(x) == Hash(y)
be equivalent to reflect.DeepEqual(x, y)
? If the answer is no, then reflect
would not be a good home since these two don't seem to be consistent with each other.
Currently, the ==
operator is not identical to reflect.DeepEqual
in the cases of pointers.
The runtime hash function is free to use different algorithms on a per architecture and even feature set basis (AES vs non AES). It think it would be nice to take this into account for the proposed std lib hash function too such that we can optimize the hash function used for each architecture and feature set independently if the proposal is accepted. Therefore, apart from only having reproducible results within a process we could clarify that there should not be the expectation that the algorithm and "quality" of the hash is the same between different releases/archs/cpus.
Another discussion could be if we want to extend the compiler to optimize this specific std lib hash function e.g. to a specialized call if the type is known during compile time.
If [
Hash
is not consistent withreflect.DeepEqual
], then reflect would not be a good home
Agreed. The Hash
function I have in mind (admittedly, thinking about experimental variations on sync.Map
) would be consistent with ==
except for irreflexive values (e.g. NaNs), which implies that it cannot also be consistent with reflect.DeepEqual
.
A reflect.DeepHash
might be useful independently, c.f.:
http://godoc.org/github.com/davegardnerisme/deephash
http://godoc.org/k8s.io/kubernetes/pkg/util/hash
โฆbut that's not the use-case I intended to address here.
Updated the proposal to name the function hash.Local
, since the consensus seems to be that neither reflect
nor runtime
is a good fit.
The first (CS related) search result for local hash is https://en.wikipedia.org/wiki/Locality-sensitive_hashing so that may not be the most appropriate name.
Some observations. I'm still not sure where I stand on this, on balance.
Another discussion could be if we want to extend the compiler to optimize this specific std lib hash function e.g. to a specialized call if the type is known during compile time.
Without this, the calls will allocate+copy, and as a result probably be prohibitively expensive in the contexts it is most likely to be used. With compiler optimization, this ends up seeming a bit like a language change.
Another alternative is to expose hash functions for basic types (ints of various widths, strings). These are the core building blocks of the rest, which can then be composed as desired, particularly if the hash function accepts a seed (more on that below).
An alternative is to expose these via the reflect package. For example, hash could be a method on reflect.Value. Or reflect.Type could return a hash function that can be invoked with values of its type.
A good concrete test case might be how e.g. [1000]uint64 gets hashed. Passed as an interface{}, this will allocate and copy 64k. If we expose a suite of hash functions, the caller either needs to loop 1000 times (slow) or we need to provide some variable length memhash, which is inherently unsafe. If we expose only via reflect, then we add the readability costs of reflection, we add the high general costs of generating a reflect.Value or the cost of an indirect function call (reflect.Type), and we tie the tool chain's hands a bit (since IIRC it can do more magic when reflect is not in use). The only reasonable choice appears to be magic compiler support, since the compiler can ~safely do unsafe things.
As an aside, an interesting test of a generics proposal is whether and how it would let you write a generic hash function (attn: @ianlancetaylor).
One other observation: The runtime hash functions accept a seed. They do this, to quote @randall77, because in the case of maps:
You have to take a seed to prevent adversaries from engineering a collision. Without the seed, the hash computation is deterministic and so an adversary can just run lots of keys through that hash and find a bunch that all hash to the same value.
I assume that a prime use case for exposing hash functions is allow people to write their own maps. We should at least afford them the opportunity to write secure implementations. That is, any exposed hash function should accept a seed, just like the runtime's do.
And as observed above, accepting a seed allows the programmer to compose hash calls as necessary.
Another discussion could be if we want to extend the compiler to optimize this specific std lib hash function e.g. to a specialized call if the type is known during compile time.
Without this, the calls will allocate+copy, and as a result probably be prohibitively expensive
Why would the calls allocate? With sync.Map
I found that the compiler's escape analysis is pretty reliable: the built-in hash function should not allow the values passed to it to escape, so any allocations for the interface{}
value to pass to it should occur (inexpensively) on the stack.
Mid-stack inlining (#19348) should omit that allocation entirely if the value is not already an interface type, since the top-level hash function will dispatch based on a constant type tag.
A good concrete test case might be how e.g. [1000]uint64 gets hashed. Passed as an interface{}, this will allocate and copy 64k.
Ahh, now I see what you're getting at. But that should still be a general optimization ("don't copy interface payloads when they are treated as const and do not escape"), not something specific to the hash function. (It's the same optimization as #18822, but made much easier because the call does not occur through an interface.)
Without this, the calls will allocate+copy, and as a result probably be prohibitively expensive in the contexts it is most likely to be used. With compiler optimization, this ends up seeming a bit like a language change.
I agree that the copy is a problem. The function could instead be defined to take a pointer to the value to be hashed so that it doesn't have to be copied.
If we do that, I don't think allocation is a problem either, since escape analysis should already realize that we're not escaping the pointer in the interface.
Arguably, the compiler could avoid the copy even if we didn't pass a pointer if it could prove that neither the hash function nor the caller mutates the value, but I'd be fine with passing a pointer. :)
That is, any exposed hash function should accept a seed, just like the runtime's do.
That's part of the proposal already, I think? I proposed that each invocation of the program should compute a process-local seed that the hash function uses implicitly; that way users won't have to figure out how to pick an to pick an appropriate seeding algorithm, and programs will have at least a basic measure of collision-resistance by default.
I proposed that each invocation of the program should compute a process-local seed that the hash function uses implicitly
I think @josharian is talking about the per-invocation seed. For example, every map in a process has a different seed, so even if someone manages to create a collision in one map, it won't carry over to another. This is in addition to the per-process seed.
I'm super-skeptical about exposing any hash function. In the case of the current map implementation, if we did ever want to do things like moving pointers (which would influence the hashes) we know where the maps are and where the hash values are written, so we could rehash. Exposing a new hash function limits our ability to make such changes, with no clear benefit. Existing packages are using other hash functions just fine without our help.
that should still be a general optimization ("don't copy interface payloads when they are treated as const and do not escape")
The runtime hash functions currently make indirect function calls, including to routines written in assembly. I'm very skeptical that escape analysis is going to plumb those depths without explicitly teaching the compiler about these special cases. Mid-stack inlining doesn't help, because the value eventually escapes.
The function could instead be defined to take a pointer to the value to be hashed so that it doesn't have to be copied.
This seems more promising, although it's a bit of a footgun: People have to remember to pass a pointer to the value to be hashed, even when that value is itself already a pointer (that is, a **T instead of a *T, or a ***T instead of a **T). I see this going wrong on a regular basis. And there's no clear way that we can help catch that mistake at compile time or at run time, at least not without a language change.
I think @josharian is talking about the per-invocation seed.
I was indeed, although it now occurs to me that that can be simulated by just hashing a struct containing the value you care about and a seed. So then the question is whether it is important/useful enough to impose it in the API signature.
if we did ever want to do things like moving pointers (which would influence the hashes) we know where the maps are and where the hash values are written, so we could rehash.
Exposing a new hash function limits our ability to make such changes, with no clear benefit.
Agreed.
For what it's worth, the thing I have wanted over and over is access to runtime.strhash. (I tend to modify my tool chain to export it, but that's obviously not something we want people doing in general.) Exposing it appropriately would not inhibit moving pointers, nor would exporting the runtime's int- or float-hashing routines which I have also wanted, although less frequently. I think this wouldn't satisfy @bcmills's original goals, though.
Exposing a new hash function limits our ability to make such changes [as moving pointers], with no clear benefit.
Doesn't reflect.Value.Pointer already limit our ability to make changes that affect object identity?
Existing packages are using other hash functions just fine without our help.
Well, sort of. I think this proposal is a lot like #15292 (generic programming), in that the primary impact of the status quo is to reduce the diversity of packages that people are willing to provide, support, and/or experiment with. The difficulty of writing hash-based data structures in user code pushes people toward less efficient workarounds (such as serializing keys to strings using encoding/binary
, encoding/json
, or another such package) or one-off solutions (such as accepting only string
keys or only int
keys).
I haven't seen anyone mention this so I'll throw in this use case as well: Disk-backed data structures. From what I've seen here, if you have a memory mapped data structure relying on hashes you'd be unable to take advantage of a hash function that doesn't allow for a seed to be passed in. Even then, if the implementation is opaque it'd be unwise as an implementer to rely on it not changing between versions unless it's explicitly called out as static. I saw @bcmills mention that @randall77 had rejected the idea in the golang-dev thread because of the specification needed, but that would be required in this case.
To make things more concrete: suppose you're building a database system that is writing a log to a mmap'd file and generating hash values to verify data. If the hash function changes on different architectures and releases, it would not work at all.
if we expose a hash function, we should randomize it so that folks won't accidentally rely on the particulars.
Please consider an API that lets callers rerandomize that. Something like func New() Hasher
.
E.g. a good DoS-safe map
implementation would randomize differently for different variables, and even rerandomize when growing.
If you let a long-running process run with a singular randomization, that lets attackers probe it and discover the logic behind the randomization.
@ScottMansfield Persistent data structures sound like a use case that should specify their own hashing, e.g. say "xxhash
of a byte sequence consisting of ..."; the hash function proposed here is meant for very different use.
It seems like you would need to separate "create a hash function (pick a random seed)" from "start hashing a particular value with that function". If you do that, and if you are sensitive to allocation (not allocating), then you might have something like:
package somehash
type HashFunction struct { ... }
func New() HashFunction
type Hasher struct { ... }
func (f HashFunction) Start() Hasher
func (h *Hasher) Int(x int)
func (h *Hasher) Bytes(x []byte)
func (h *Hasher) String(s string)
func (h *Hasher) Sum() uint64
That seems fine, but it doesn't seem like it needs to be in the standard library. Why not experiment outside the standard library with something like this first?
I think the path forward here would be to experiment outside the standard library, as mentioned in the previous comment. That would help us gather some evidence about how well it works, how much it's needed and so on. Closing until we have that evidence.
You can add to the list of "typical workarounds" requirement to calculate the hash, like in https://github.com/larytet/mcachego/blob/master/hashtable/hashtable.go#L141