Tracking issue for HashMap::raw_entry
sfackler opened this issue · 47 comments
Added in #54043.
As of 6ecad33 / 2019-01-09, this feature covers:
impl<K, V, S> HashMap<K, V, S>
where K: Eq + Hash,
S: BuildHasher
{
pub fn raw_entry(&self) -> RawEntryBuilder<K, V, S> {…}
pub fn raw_entry_mut(&mut self) -> RawEntryBuilderMut<K, V, S> {…}
}
pub struct RawEntryBuilder<'a, K: 'a, V: 'a, S: 'a> {…} // Methods return Option<(&'a K, &'a V)>
pub struct RawEntryBuilderMut<'a, K: 'a, V: 'a, S: 'a> {…} // Methods return RawEntryMut<'a, K, V, S>
pub enum RawEntryMut<'a, K: 'a, V: 'a, S: 'a> {
Occupied(RawOccupiedEntryMut<'a, K, V>),
Vacant(RawVacantEntryMut<'a, K, V, S>),
}
pub struct RawOccupiedEntryMut<'a, K: 'a, V: 'a> {…}
pub struct RawVacantEntryMut<'a, K: 'a, V: 'a, S: 'a> {…}
… as well as Debug
impls for each 5 new types, and their inherent methods.
What is the motivation for having separate from_hash
and search_bucket
methods? It seems that the only difference is whether the hash value is checked before calling is_match
. However if the table does not store full hashes (i.e. hashbrown
) then there is no difference between these methods.
Could we consider merging these methods into a single one? Or is there some use case where the difference in behavior is useful?
I am also extremely confused by this distinction, as my original designs didn't include them (I think?) and the documentation that was written is very unclear.
The reason I added search_bucket
was because I wanted to be able to delete a random element from a HashMap in O(1) time, without storing an extra copy of all the keys. Basically, instead of doing something like this:
let key = map.iter().nth(rand() % map.len()).0.clone();
map.remove(&key);
I wanted to just be able to pick a random "bucket" and then get an entry/raw entry to the first element in it if any:
loop {
if let Occupied(o) = map.raw_entry_mut().search_bucket(rand(), || true) {
o.remove();
break;
}
}
(the probabilities aren't uniform in the second version, but close enough for my purposes)
I continue to not want to support the "random deletion" usecase in std's HashMap. You really, really, really, should be using a linked hashmap or otherwise ordered map for that.
I have removed this method in the hashbrown PR (#56241). Your code snippet for random deletion won't work with hashbrown anyways since it always checks the hash as part of the search process.
I can avoid unnecessary clones inherent to the original entry
API which is nice. But unless I'm mistaken, the current raw_entry
API seems to hash the keys twice in this simple use case:
#![feature(hash_raw_entry)]
use std::collections::HashMap;
let mut map = HashMap::new();
map.raw_entry_mut()
.from_key("poneyland")
.or_insert("poneyland", 3);
Currently I use the following function to hash once and automatically provide an owned key if necessary (somewhat similar to what was discussed in rust-lang/rfcs#1769):
use std::borrow::Borrow;
use std::collections::hash_map::RawEntryMut;
use std::hash::{BuildHasher, Hash, Hasher};
fn get_mut_or_insert_with<'a, K, V, Q, F>(
map: &'a mut HashMap<K, V>,
key: &Q,
default: F,
) -> &'a mut V
where
K: Eq + Hash + Borrow<Q>,
Q: Eq + Hash + ToOwned<Owned = K>,
F: FnOnce() -> V,
{
let mut hasher = map.hasher().build_hasher();
key.hash(&mut hasher);
let hash = hasher.finish();
match map.raw_entry_mut().from_key_hashed_nocheck(hash, key) {
RawEntryMut::Occupied(entry) => entry.into_mut(),
RawEntryMut::Vacant(entry) => {
entry
.insert_hashed_nocheck(hash, key.to_owned(), default())
.1
}
}
}
Given k1
and k2
with the same type K
such that hash(k1) != hash(k2)
, is there a use-case for calling RawEntryBuilderMut::from_key_hashed_nocheck
with hash(k1), &k1
and then inserting with RawVacantEntry::or_insert
using k2
?
If there isn't, why not saving the hash in RawVacantEntryMut
and using it inside RawVacantEntryMut::insert
? It would even be possible to assert in debug builds that the owned key has indeed the same hash as the borrowed key used to lookup the entry.
I'm not yet very familiar with this API, but what @gdouezangrard suggested seems like a great idea to me. What even happens currently if the two hashes don't match, is the key-value pair then inserted into the wrong bucket? It's not clear to me from (quickly) reading the source code.
I submitted rust-lang/hashbrown#54 to support using a K
that doesn't implement Hash
via the raw entry API. See rust-lang/hashbrown#44 for the original motivation. Now that hashbrown is merged into std
, could we expose this functionality on the std::collections::hash_map
types as well?
If so, I'd be happy to submit a PR!
This is a really great API, it's also what keeps crates (hashlink
for example) using hashbrown
instead of the stdlib hash map -- since hashbrown
exposes this.
What could be next steps here towards stabilization?
Just gonna add another ping here -- what's blocking this right now?
I see a few things that need to be resolved:
- We need proper documentation for the API functions, including examples.
- hashbrown has a few extensions to RawEntry, such as https://docs.rs/hashbrown/0.9.1/hashbrown/hash_map/enum.RawEntryMut.html#method.and_replace_entry_with. We should decide whether to port these to std as well.
- The name feels a bit wierd since it isn't clear what "raw" means. But I don't have any better ideas.
I would recommend prototyping in the hashbrown crate first, which can then be ported back in the the std HashMap.
I find raw_entry
and raw_entry_mut
methods unnecessary - unlike entry
method, they don't take any parameters, they just provide access to methods that could as well be in HashMap
itself. I think I would consider getting rid of those and putting raw entry APIs directly in HashMap
. .raw_entry().from_key(...)
is also unnecessary, unless I'm missing something it's identical to already stabilized .get_key_value(...)
.
I also would like to point out that RawVacantEntryMut
doesn't really do much other than providing an API that allows insertion which provides a reference to inserted key and value. This structure doesn't store anything other than a mutable reference to a hash map. This particular API can be used to create unrelated keys, like in this example.
#![feature(hash_raw_entry)]
use std::collections::HashMap;
fn main() {
let mut map = HashMap::new();
map.raw_entry_mut().from_key(&42).or_insert(1, 2);
println!("{}", map[&1]);
}
This is a bit like calling insert
after determining an entry is vacant. I think raw_entry_mut
APIs could return Option
s just like raw_entry
APIs.
#![feature(hash_raw_entry)]
use std::collections::hash_map::{HashMap, RawEntryMut};
fn main() {
let mut map = HashMap::new();
if let RawEntryMut::Vacant(_) = map.raw_entry_mut().from_key(&42) {
map.insert(1, 2);
}
println!("{}", map[&1]);
}
I think raw entry API is useful, but I don't think its API should be conflated with entry API.
As discussed here: rust-lang/hashbrown#232
Allowing the user to specify the hashed value with the contract that it is generated in the same way that the map computes that hash has two drawbacks:
- It locks in the implementation of the Hashmap to never changing how that code is invoked. In particular this prohibits hashmap from ever using specialization. This is leaving significant performance gains on the table for types with fixed lengths and short strings. (This makes
raw_entry
a non-zero-cost-abstraction because the cost is incurred even if the feature is not used.) - It creates an opportunity for a bugs in applications that accidently do something different. If for example an application takes advantage of this to create a memoized hash or similar, and their calculation is different in some cases the results will be unexpected and lack a clear error message.
If the feature of a user specified hash is needed, it may be useful to instead provide a method on the raw entry to hash a key. That way the hashmap can implement this however it sees fit and the application code is less error prone because there is an unambiguous way to obtain the hash value if it is not known in advance.
There is a very simple and common use case that, as far as I can tell, is not currently possible with the standard HashMap
, which covers most of the times I have wanted raw_entry_mut
. Is it possible to support this without the full generality and complexity of the raw entries API?
Description of the use case:
Imagine I have HashMap<Vec<u8>, usize>
. What I want to do is:
(1) check if some key exists,
(2) if so, get the corresponding value; if not, write a new value.
There are two options I can see, each of which has unnecessary overhead.
First option:
let x = my_map.entry(key.clone().or_insert(42);
Second option:
let x = if let Some(x) = my_map.get(&key) {
*x
} else {
my_map.insert(key.clone(), 42);
42
};
The defect of the first option is that it requires a clone of the key even when no insertion is necessary. The defect of the second option is that it requires two hash computations and lookups when an insertion is necessary.
I find myself wanting to do this practically every time I use hash maps, so it's possible this is an already supported basic API and I'm just missing it in the docs.
I find myself wanting to do this practically every time I use hash maps, so it's possible this is an already supported basic API and I'm just missing it in the docs.
No, this requires raw_entry_mut.
Thanks for confirming. Has the idea of adding an API for this specific use case (e.g., get_or_clone_key_and_insert_with
, which would of course only be available when K: Clone
) been discussed? I feel like it would be much simpler than nailing down the fully general raw_entry_mut
.
Yes, there was a great deal of discussion about this in the past.
Note that K: Clone
isn't quite enough for common str
/String
or [T]
/Vec<T>
cases. It would need to be ToOwned
, probably at the very least.
Do you know where to find the past discussion? I'm interested in learning whether/why that idea was rejected. After that I will stop spamming this issue; sorry!
- rust-lang/rfcs#1769
- https://internals.rust-lang.org/t/pre-rfc-abandonning-morals-in-the-name-of-performance-the-raw-entry-api/7043
- https://internals.rust-lang.org/t/head-desking-on-entry-api-4-0/2156
- rust-lang/rfcs#1533
- rust-lang/rfcs#1203
Actually, there are way more discussions that this doesn't bring up, but this is a start.
IMO raw_entry
is a poor API because it is trying to do too much at once. It is trying to serve two different types of use cases:
- High level cases where you just want to avoid having to pass ownership of the key into
entry
. These might be better served with anentry_ref
API that takes aT: ToOwned
. Or perhaps allowing the user to provide the key separately when inserting (if it doesn't match the key used for the lookup then that's a user error similar to a misbehavingEq
impl). - Low level cases where you are doing fancy things like hash memoization, value interning or dealing with values that don't implement
Eq
/Hash
because they refer to data outside the HashMap (e.g.IndexMap
). These might be better served by using a lower-level hash table API such as hashbrown'sRawTable
which has methods like this:
// - Just a single T instead of (K, V).
// - No bounds on T.
// - Hashing and equality checking is completely user-controlled.
// - A hasher is provided by the caller for functions that might cause a table resize.
pub fn get_mut(
&mut self,
hash: u64,
eq: impl FnMut(&T) -> bool
) -> Option<&mut T>;
pub fn insert_entry(
&mut self,
hash: u64,
value: T,
hasher: impl Fn(&T) -> u64
) -> &mut T;
Reading the past discussions posted by Thom, I didn't find where entry_ref
, entry_or_clone
and similar ideas were clearly rejected -- just that discussion on them has fizzled out because raw_entry
will subsume those use cases.
It has been five years - perhaps it's time to land those so people can start using HashMaps now without performance overhead, rather than waiting for raw_entry_mut
to maybe someday be stabilized.
Low level cases where you are doing fancy things like hash memoization, value interning or dealing with values that don't implement
Eq
/Hash
because they refer to data outside the HashMap (e.g.IndexMap
). These might be better served by using a lower-level hash table API such as hashbrown'sRawTable
which has methods like this:
This applies to BTreeMap
and BinaryHeap
too. Maybe we could have a Raw*
version of each collection?
We reviewed this in today's @rust-lang/libs-api meeting and agreed with the summary in #56167 (comment).
We'd like to see an implementation of entry_ref
which allows an Entry
to be created from just a reference to a key, which would be held in a Cow
internally and upgraded to an owned key when necessary for an insertion.
We feel that any more advanced use cases are better off using hashbrown
's RawTable
API directly which is much more flexible than the HashMap
API.
@rfcbot fcp close
Team member @Amanieu has proposed to close this. The next step is review by the rest of the tagged team members:
Concerns:
- keep unstable for now (#56167 (comment))
Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!
See this document for info about what commands tagged team members can give me.
I'm fine with the conclusion but I would like we have the alternative in nightly before remove raw_entry()
from nightly we loose nothing wait for entry_ref
and several code rely on this feature, I'm sure people are fine remove raw_entry from their code but they will want the alternative.
I agree with #56167 (comment), don't think there is any rush to delete this. Keeping it in front of people will help people form an opinion of how to serve each of the two use cases better.
@rfcbot concern keep unstable for now
I created an initial implementation of the entry_ref
API (rust-lang/hashbrown#301) which is available in version 0.12 of hashbrown in case people want to try it out.
@rfcbot cancel
We've discussed this in the libs api meeting. We'll keep this around as unstable until we have something better.
One thing I wanted to record for consideration in whatever this becomes:
It's certainly true that for unknown generic types, one cannot trust the Hash
impl, and thus can't trust HashMap
for arbitrary generic types when it comes to soundness.
But it might be nice to be able to trust, say, a HashMap<usize>
(type and hasher both from the standard library) that someone gave you in your unsafe code actually only has distinct items in it. And that would imply that the various unchecked raw APIs would need to be unsafe
or a different type.
For example, imagine a fn deref_mut_multiple<T>(slice: &mut [T], indexes: HashSet<usize>) -> impl Iterator<Item = &mut T>
. Without the raw
methods, that can be sound without extra checks, just mapping over get_unchecked. But the existence of the raw
methods makes that unsound, and in a way that can't be solved by something like TrustedHash
/TrustedEq
marker unsafe trait
s.
I’ve replaced a use of entry
with raw_entry_mut
in order to avoid memory allocation for keys already present in a HashMap
. To simplify, keys in the map are Cow<'a, [u8]>
and the input is &'b [u8]
. In some cases lifetimes match so Cow::Borrowed
can be used, but in other cases Cow::Owned
is needed.
entry_ref
is not a good fit since it relies on From
impls to convert to an owned key, so I’d need two wrapper types for the two cases. (Here "owned" is Cow<[u8]>
as opposed to &[u8]
, not Cow::Owned
.) raw_entry_mut
provides more flexibility on how to do that conversion.
RawTable
provides all of the flexibility but much lower-level, so switching would be a far-reaching change. Even in if that still makes sense (removing the lifetime parameter in maps and associated owning-ref
-like hacks may be worth using indices into side storage instead of Cow::Borrowed
), RawTable
has some rough edges: no documentation, seemingly-unnecessarily unsafe
iterators, etc.
Perhaps a middle ground would be to change entry_ref
to remove its use of From
, and instead have APIs where the owned key is passed explicitly like in raw_entry_mut
?
Is there any chance to get a RawEntry
API for BTreeMap
s as well, also as an unstable feature for the moment until a final decision is made?
There was a mention of BTreeMap
s in Abandonning Morals In The Name Of Performance: The Raw Entry API but I could not find any followups, it seems to have been forgotten in the discussions
The API for BTreeMap is much the same except everything about hashing is removed. It’s possible it would also just support prev()/next() on entries, in which case a Real immutable entry type would be necessary. At that point we may also wish to swap the API naming polarity back to the typical raw_entry + raw_entry_mut scheme.
cc @Gankra
For anyone reading this RFC while exploring a successor/better alternative to hash_raw_entry
, please keep in mind a very common pattern that wasn't cleanly covered by the prototype raw entry api: cleanly bubbling back whether or not a key was entered into the map, should that information be needed down the line.
The (un)released .raw_entry[_mut]().or_insert_with(|| ...)
method returned a tuple of (&K, &V)
but I think returning (&K, &V, bool)
would have reduced boilerplate code without incurring too much of a burden (since one of &K
or &V
in the tuple is oftentimes already masked because only the other is required), in lieu of the following:
let mut new_pool: bool = false;
let (_, old_status) = cached_status.raw_entry_mut()
.from_key(pool)
.or_insert_with(|| { new_pool = true; (pool.to_owned(), *new_status) });
if new_pool || old_status != new_status {
eprintln!("{}: {:?}", pool, status);
*old_status = *status;
}
which could have just become
let (_, old_status, new_pool) = cached_status.raw_entry_mut()
.from_key(pool)
.or_insert_with(|| (pool.to_owned(), *new_status));
if new_pool || old_status != new_status {
eprintln!("{}: {:?}", pool, status);
*old_status = *status;
}
The "regular" HashMap::insert()
returned an option by means of which it was possible to deduce if the key already existed, while the unstablized HashMap::try_insert()
returns a Result in an error state if the key already existed. (The "regular" non-raw entry api suffers from a similar flaw.)
Is there a reason for the HashMap::raw_entry
and HashMap::raw_entry_mut
methods to require S: BuildHasher
? Perhaps this bound can be moved to RawEntryBuilder::from_key
and RawEntryBuilderMut::from_key
. This would make it possible to use a hash map without an intrinsic hasher builder (i.e. S = ()
), accessing entries purely through the raw entry API.
Is there a reason to overload this into std::collections::HashMap? If instead we had a std::collections::RawHashMap
that has a dedicated API, does that simplify the need to merge cleanly in between existing non-raw methods?
I think that could get rid of the (at least to me) confusingly named raw_entry_mut
and raw_entry
which don't actually return entries like you might expect with sibling methods named entry
but instead return views to access entries in a raw way (raw_entries
/ raw_entries_mut
might be better names if bikeshedding). I'm wondering if instead you have a RawHashMap
that has the methods of raw_entry
/raw_entry_mut
exposed as normal top-level methods. That way RawHashMap
would be a clean break in that it wouldn't even need an S
as the API contract there is that the hash is always managed externally (i.e. the HashMap
interface can remain unchanged).
In fact, I don't think you'd even need a RawHashMap<K, V>
- it could just be RawHashTable<V>
or RawHashSet<V>
where the key is the hash and V either implements PartialEq OR there's a function to determine whether two V
instances are equivalent for a colliding hash. The std HashMap
could then just be a wrapper on top of RawHashTable<(K, V)>
(although the tuple is actually probably a named struct that implements PartialEq
evaluating just K
). RawHashTable
could also then be used for both HashMap
and HashSet
which is nice in terms of reducing maintenance (i.e. don't have to reimplement the "raw" APIs for HashSet
which is currently missing from nightly).
Thoughts? I'm curious who's driving this / who I'd need to be coordinate with if there's interest in this alternative path.
Is there a reason to overload this into
std::collections::HashMap
?
Many people and APIs already use HashMap
/HashSet
, so for this to be useful it needs to be available on HashMap
/HashSet
. Otherwise people could just switch to a more flexible solution, i.e. hashbrown::HashMap
/hashbrown::HashTable
.
but instead return views to access entries in a raw way (raw_entries / raw_entries_mut might be better names if bikeshedding)
+1 on this. To be more intuitive I would also rename the from_
methods to find_by_
since we're calling them on a RawEntries
, not a single RawEntryBuilder
.
I'm wondering if instead you have a
RawHashMap
IMO this would be useful only if it was accessible from an instance of HashMap
, but at that point it seems another renaming of raw_entry
rather than an entirely separate type.
where the key is the hash and V either implements PartialEq OR there's a function to determine whether two V instances are equivalent for a colliding hash.
Definitely need a way to use an external function since many usecases rely on that.
The std
HashMap
could then just be a wrapper on top ofRawHashTable<(K, V)>
Currently HashMap
is a wrapper on top of hashbrown::HashMap
and forwards most calls to it. Making it a wrapper over another wrapper of hashbrown::HashTable
would likely increase maintenance costs since all the HashMap
and HashSet
methods would now have to be reimplemented on top of it.
What I'm suggesting would be that HashMap and HashSet would just expose a raw/raw_mut to return the underlying RawTable. As is, anyone interested in using the raw interface still has to play this dance pretending like there's a hasher which makes things confusing to maintain (in case someone in your codebase doesnt realize your only using the raw interface and decides to store keys directly).
Doing a lift and shift doesn't honestly seem like that much work and I don't see why long term maintenance work would increase with this approach.
I would like to deprecate and remove raw_entry
in favor of the low-level HashTable
API that has been added to hashbrown. I am not suggesting adding HashTable
to the standard library: people who need it should just use it directly from hashbrown.
HashTable
provides a cleaner approach since it is an entirely separate type and doesn't need to deal with the restrictions of HashMap
such as the need for separate key & value types and the need for an explicit hasher.
Note that HashTable
is not convertible to/from HashMap
(via an as_raw
method). However I don't believe there are any use cases where this matters and which could not just use HashTable
directly instead.
@rfcbot fcp close
Team member @Amanieu has proposed to close this. The next step is review by the rest of the tagged team members:
No concerns currently listed.
Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!
See this document for info about what commands tagged team members can give me.
Doesn't this decision mean binary bloat because std::hashmap will exist because some dependency somewhere uses it and a duplicate implementation would be pulled in through hashbrown for those that need HashTable? Is there a reason not to standardize HashTable?
EDIT: To be clear, I'm not against this, but I'd like to understand if there's an interest in standardizing HashTable
I haven't been following this that closely, so for the benefit of those not paying a lot of attention, let's say that a user wants to do the raw_entry thing with an expensive key that they only want to construct if the map doesn't have it (ie, https://stackoverflow.com/questions/51542024/how-do-i-use-the-entry-api-with-an-expensive-key-that-is-only-constructed-if-the).
Is the answer of using HashTable
just....don't use HashMap and use HashTable if you want to do that? Just to be clear, I haven't been following very much, so I appreciate anyone's help walking me through what the solutiong that amanieu is proposing is.
Is the answer of using
HashTable
just....don't use HashMap and use HashTable if you want to do that?
hashbrown
's answer to that is the entry_ref
method, which works like entry
except it constructs the key only when calling insert
with a vacant entry. For now I don't see proposals to add a similar API to the stdlib though.
HashTable
would instead be the solution for those that need an extremely flexible API that lets them manually supply the hash. In raw_entry
terms this would be the equivalent of RawEntryBuilder{Mut}::from_hash
and RawEntryBuilder{Mut}::from_key_hashed_nocheck
.
🔔 This is now entering its final comment period, as per the review above. 🔔
Checking my box here, and hoping that HashTable and some portion of RawTable (enough for dashmap
for instance) can be stabilized in the future.
The final comment period, with a disposition to close, as per the review above, is now complete.
As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed.