crossbeam-rs/rfcs

Concurrent hash table

Opened this issue · 43 comments

This is just a brainstorming issue for concurrent hash tables.

@Amanieu, do you think we could make hashbrown concurrent by having a lock per bucket? When resizing the table we'd probably have to lock all buckets. That sounds like a rather straightforward approach for building a totally generic concurrent hash table without much fuss.

I'm sure we could devise a faster hash table for more specialized use cases. Apart from a fully generic HashMap<K, V>, we'd probably want to build a special HashMap<K: Copy, V> too that can avoid locks in some places.

Thoughts? What do you think would be the best way forward here? Ideally something not overly ambitious that at least gets us started. :)

cc servo/servo#22334
cc @SimonSapin

Here's a design which keeps all values in the table itself rather than through a pointer.

EDIT: There is actually a flaw with this design, which could allow the same key to be inserted twice in the map.

This is a very efficient layout since it doesn't suffer from pointer indirection. In effect, it would look just like hashbrown does today. However (K, V): Clone is required at minimum to support growing the hash table.

Once an element has been added to the table, it can be marked as erased, however the value itself will only be dropped when the table is destroyed (or the table grows and the old table is GCed).

The table will use 8-bit control bytes, just like hashbrown, but with the following bit allocation:

  • 1 bit to mark the table being resized. Lookups will ignore this bit, insertions and deletions will abort the current operation and block until the table resize is complete. (This was partly inspire by the Junction hashmap)
  • 1 bit to differentiate between empty and full entries.
  • 6 bits of the hash code to speed up lookups. For empty entries these bits are used for extra info (deleted, reserved, etc).

Lookup is wait-free: just ignore the top bit when scanning the table.

Insertion will first try to decrement the table capacity counter, and resize the table if the capacity is full. Then search for an empty control byte, CAS it to "reserve" it, write the value to the data slot, then CAS the control byte again to "publish" the element. If at any point the resize bit is set, the operation is aborted.

Deletion is evil simpler, just perform a CAS to mark the control byte as empty. If the resize bit is set, the operation is aborted.

Aborted operations will basically block until the resize has finished, and the retry.

Resizing basically goes through the entire table and sets the resize bit on every element, while cloning them to a new table. After the resize is complete, wake up any threads waiting for the resize.

Interesting! What about insertion when the same key already has a value in the map?

(Aside: While reading this I was wondering if we could get away with moves instead of (K, V): Clone. In other words, can the semantic transfer of ownership (when the new and old table are swapped) be decorrelated from the implementation of the move (shallow byte copy)? I think the answer is no, not while preserving Sync-correctness in the possible presence of interior mutability. If V contains a Mutex<T> for example, we could be trying to "move" it while another threads is in the middle of writing to it. Clone::clone however receives &self so it’s the impl’s responsibility to acquire that mutex to correctly clone the value while preserving Sync.)

An easy way to get rid of the Clone requirement is to just put the values in an Arc. I have some ideas about a different table design based on pointers, but haven't fully figured out all the details yet.

As for insertion when a key already exists, I forgot to mention that part. Essentially, after a successful insertion, you need to perform another scan and delete all equal elements that are "before" the newly inserted one.

If we are careful with the memory orderings then we should be able to ensure that a thread performing lookups will never see a "gap" where the item it is looking for is not found.

@Amanieu Interesting, that sounds like a pretty good design.

I wonder, would it makes sense to make resizing lock-free like in Cliff Click's hash table (but maybe not in the first release of the hash table)?

Another concern I have is what if we have a HashMap<String, String>? Then resizing would create clones of all strings, which might be a little expensive. But I suppose then we should use HashMap<Arc<str>, Arc<str>> instead?

Do you have an opinion on adaptive radix trees? The good thing about this tree is that it doesn't need resizing so we could just have Art<String, String> without ever cloning strings.

Hmm actually after thinking about it for a bit, my design is broken because it allows 2 concurrent inserts of the same key to create 2 separate entries in the table. So please disregard it.

I think the only way to keep the using the fast SIMD lookups from SwissTable is to use a sharded lock for inserts & deletes, where the shard is selected based on the hash. This also avoids the need for a resize bit in the control bytes.

HashMap<String, String> v.s. HashMap<Arc<str>, Arc<str>> is docs concern, isn’t it?

Yes, the tradeoffs would have to be explained in the documentation, since clone is called on table growth.

Suppose we go with cloning keys & values when resizing the table. What do you think the HashMap::get() method should return?

  • An Option<EntryGuard<'a, K, V>>?
  • Or, a cloned value Option<V>?

I was thinking of returning something along the lines of a Option<&'guard (K, V)>. There is no need to copy values out unless absolutely required.

Useful comment with links: crossbeam-rs/crossbeam#41 (comment)

It might be worth looking into https://github.com/efficient/libcuckoo because it's a concurrent hash table with high performance but uses locking here and there and locks the table during resizes. Apart from the "cuckoo" part, it seems very similar to what we're aiming for here.

In general, I'm biased towards trie structures for this case since they provide very good performance overall and tend to fare better in many-writer scenarios, whereas hashmaps can have very poor scaling with many writers. The ART paper is especially compelling in this regard given the implementation simplicity.

Relativistic hash tables from the linux kernel 1 should be easy to port to Rust but require RCU.
Do you think Crossbeam already have all the RCU support that would be needed ?

This hashtable is great because it does not block reader during table resize, so it give stable low latency access.

I would propose using a variant of a Jagged Array™ as the backing structure of a concurrent hash-table.

The structure is pretty simple:

struct JaggedArray<T> {
     data: [AtomicPtr<T>; 32],
}

Where each pointer points to an array:

  • the first array has n elements (decided either at compile-time or run-time),
  • each subsequent array doubles the number of elements of the structure.

This means data[0].size() == data[1].size() == data[2].size() / 2 == data[3].size() / 4 == ....

Note: I came up with the name, and I am unaware of any prior art. I would definitely appreciate pointers if more knowledgeable people are, I am doubtful such a simple concept is not already used somewhere.


It is potentially feasible to grow a hash-map in a lock-free manner using this structure:

  • Doubling the capacity of the array can be done lock-free (if the underlying allocator is lock-free), although this may cause wasteful allocations: threads losing the CAS would need to de-allocate their freshly allocated array.
  • Once the underlying storage has grown, in average 50% of the items need to be moved; this can probably be done incrementally, by having each insertion checking/moving 2 pre-existing items... An epoch-counter seems to be necessary, however, to avoid invalidation of currently loaned out references.

Or, a crazy alternative, never move:

  • limit the size of the data member to 4 pointers,
  • multiply the number of entries by 8, rather than 2, each time.

This means: N, 7 x N, 56 x N and 448 * N for a total of 512x the initial capacity.

Advantages:

  • No epoch/reclamation juggling: references can be safely loaded out.
  • No Copy/Clone necessary: elements are never moved once inserted.

Disadvantages:

  • Size Guess: user has to guess the maximum size they will ever need.
  • Search: up to 4x slower, when having to search from maximum size (512 * N), then previous size (64 * N), then previous size (8 * N), and finally initial size (N). Systematic for not-found cases.

BTW: I am not sure that cloning is such a rich idea if handing out references. For example, while AtomicUsize could cloned, a user with a reference to the "old" value can still mutate it, and the mutation is not reflected in the "new" value, which is rather surprising.

This can be solved in many ways, but memory-stable elements is probably the easiest. Wrapping them in Arc and returning Arc<(K, V)> is a solution; although it has a cost.

BTW: I am not sure that cloning is such a rich idea if handing out references. For example, while AtomicUsize could cloned, a user with a reference to the "old" value can still mutate it, and the mutation is not reflected in the "new" value, which is rather surprising.

This isn't really a problem in practice since Atomic* and Mutex don't implement Clone. This means that you would have to manually implement Clone if you wanted to add a type with thread-safe interior mutability to a concurrent hash table (#[derive(Clone)] won't work).

In case people are not aware of quiescent state based reclamation (QSBR), here is a concurrent map that was implemented in C++ based on QSBR and it outperforms libcuckoo: https://preshing.com/20160222/a-resizable-concurrent-map/

When implementing such primitives, I would recommend looking at the state of the art in game development.

In case people are not aware of quiescent state based reclamation (QSBR), here is a concurrent map that was implemented in C++ based on QSBR and it outperforms libcuckoo: https://preshing.com/20160222/a-resizable-concurrent-map/

When implementing such primitives, I would recommend looking at the state of the art in game development.

Isn't QSBR just epoch-based reclamation?

It may be necessary, however putting the onus on the user to find a good place to bump the epoch on each thread is not as ergonomic as automatically handling it for them. Unless it is impossible to come up with a design which is ergonomic, or the performance cost is too high, I'd favor the more ergonomic approach.

Thanks for the link; I think the idea of the "Reclaimed" marker, whether there or in the index table, is an excellent way of facilitating incremental migration... especially in Rust where we can perform a bitwise copy from old to new table and "forget" to call the destructor of the elements in the old table.

Isn't QSBR just epoch-based reclamation?

They are similar but not the same. One the readers side EBR uses something like:

EnterEpoch(); 
// read
ExitEpoch(); 

whereas QSBR identifies quiescent state by having the application call it when convenient. As you pointed out, it is not as ergonomic but the ease of use of EBR comes at a performance cost.

Here is a good reference when comparing EBR, QSBC and hazard pointers http://www.rdrop.com/users/paulmck/RCU/hart_ipdps06.pdf

@stjepang By the way, while lock-free/wait-free are good, they say nothing about contention.

For an extreme example, imagine 96 threads (2x24 cores, multithreaded) all concurrently trying to clone the same Arc<T>, then drop the clone, in a loop. This is going to generate a massive amount of inter-core/inter-socket traffic as each "pseudo" core attempts to acquire the cache-line for exclusive access.

Furthermore, still with Arc<T>, there is a false-sharing issue in that any thread cloning/dropping a clone of the Arc requires exclusive access to the cache-line, which negatively affects readers who already have their own clone and are simply reading the value of T.

How concerned should we be about such contention/false-sharing in the design?

Or otherwise said:

  • Should the hashmap store Arc<(K, V)> and return a clone, despite the contention?
  • Should the hashmap store a SharedPtr<(K, V)> and return a clone, using C++ style independent control-block so readers are not bothered by false-sharing?
  • Should the hashmap store something and return a &(K, V), although maintaining the reference in the presence of deletion may be challenging?

I'd say a "good enough" solution is whatever matches the performance and simplicity of libcuckoo. Let that be our baseline as it has strong academic support, fares well in all published benchmarks, and I've heard it's been very successful in industry. It's become the go-to concurrent hash table in C++.

Lookups into the hash table are a bit tricky. Returning a clone of Arc<V> is too expensive if V is something simple like an integer. Returning &V safely is going to be very difficult in concurrent settings. I'm thinking we should return either a LockGuard<V> or a clone of V. Given that libcuckoo is returning a clone of V, I'm currently leaning towards it, but need to do more research.

Libcuckoo API: http://efficient.github.io/libcuckoo/classcuckoohash__map.html

But if V is too big to be read in a single atomic instruction, making a clone of V needs some kind of locking or other synchronization so that the value keeps being valid while the clone is being made. If that’s required anyway, why not return LockGuard<V>?

@SimonSapin True, what you're saying makes total sense. Not sure why libcuckoo doesn't return a guard as well... It's possible that the only reason is they didn't want to expose the internal locking to the outside world.

Thanks for the libcuckoo code.

It seems pretty lock-heavy (using spin-locks, but still). In Rust, the absence of copy-constructor should allow optimistic reads which avoids locking.

I'm thinking we should return either a LockGuard<V> or a clone of V.

A LockGuard<V> actually comes with all the burden of references: while the element is locked, any operation on the element blocks. It cannot be moved, or deleted.

This would be a very good reason for libcuckoo avoiding it, and preferring cloning.

Returning &V safely is going to be very difficult in concurrent settings.

I've been thinking about it, and the stumbling block was deletion. You can allocate Arc<V> so as to be able to move the Arc<V> around and still have a stable reference, however you need some kind of "queue" of to-be-deleted Arc<V> that may still be dereferenced... aka epoch-based/quiescent-state/hazard-pointer issues.


@SimonSapin

But if V is too big to be read in a single atomic instruction, making a clone of V needs some kind of locking or other synchronization so that the value keeps being valid while the clone is being made.

Indeed; there are potentially better ways than locking for reads, though.

If that’s required anyway, why not return LockGuard<V>?

I would really prefer NOT to return a LockGuard<V> if we can avoid it.

In libcuckoo, the maximum number of locked elements, at any time, is 2x or 3x the number of threads. With a LockGuard<V>, a single thread could keep a lock on thousands of elements. You could blame the user for their carelessness, or recognize that returning a LockGuard<V> makes for an error-prone API.

Furthermore, a LockGuard<V> forces you into locks (and/or epoch tricks). And most likely fine-grained locks. Ad-vitam æternam. By comparison, returning a clone of V leaves you free to implement any kind of synchronization mechanism.

Rust has a BIG advantage on C++ here: the absence of Copy Constructor. It means you can copy a block of raw bytes elsewhere, then treat them as a V to clone: cloning a String can be expensive, but the raw bytes copy was only 24 bytes, which is super cheap. This matters since only the raw bytes copy is potentially contending with writes! Of course, this also apply to inserts, but not updates.

This same advantage also allows Optimistic Reads:

  • Read: (In a loop) read group epoch, if odd start again, read raw bytes, check group epoch, if changed start again, clone.
  • Insert/Update: Lock group, bump epoch (becomes odd: .fetch_or(1)), write, bump epoch (becomes even: .set(prev + 2)), unlock.
  • Delete: Lock group, mark element deleted, unlock; readers are not affected.

This is relatively simple to implement, I would say, and offers lock-free reads. In Rust, it also works with any kind of V, even non-Sync ones.

Note: you may also pick one lock/epoch per element if supporting Update is desired. Updates allow unbounded, and possibly panicking, user code to run though.

👋 Scala developer here.

Do you have an opinion on adaptive radix trees?

Scala uses a TrieMap. In my experience traversing elements is often much slower on tree based maps. This might be something to consider.

Just out of curiosity, how does libcuckoos's performance compare to Java's ConcurrentHashMap? It might be an apples vs oranges comparison but interesting nonetheless.

Here is a quick introduction to the design principles of the ConcurrentHashMap in Java. A few things are obviously bound to the memory model of the JVM.


Feel free to move or delete my post if it's not adding any value 🙂

I was considering the cost of a reader being the last owner of an Arc<T>, and having to bear the cost destroying T and freeing the underlying memory; an operation of potentially unbounded complexity.

It turns out that as long as the concurrent map is generic over the key and value types, and does not enforce Arc<(K, V)> for example, then this can be solved in an orthogonal fashion. crossbeam could simply offer a type which embeds a sink, and upon destruction sends its content into the sink, to be disposed off on whoever polls the source (such as a background thread, or a producer thread recycling allocated objects).

The exact details of such a type do not have to be worked out here; however I believe it makes a strong case for leaving the user in charge of deciding exactly which type should be stored in the map, rather than arbitrarily using Arc or some other (allocating) wrapper.

Thanks everyone for your comments! I still haven't gotten the time to really dig into all of them, but they are appreciated <3

Just a quick update: @Amanieu is prototyping a concurrent hash table based on hashbrown that uses a combination of locks and epochs. We discussed this during the Rust All Hands and figured it'd be good to provide a low-level concurrent hash table on top of which higher-level hash tables with different tradeoffs can then be built. For example, we could have a higher-level layer that clones Arcs on lookups, another layer that uses epochs instead, or a layer that clones values.

There is a design and a good chunk of a Rust implementation for a concurrent hash table based on nested hash tables, so that reallocation would not block all threads: http://ticki.github.io/blog/an-atomic-hash-table/

@Shnatsel : The idea is ingenious, however following a chain of pointers can lead to poor cache behavior, and repeatedly hashing has adverse performance (or did I not understand the seeded hashing?).

If looking up in multiple sub-tables is an option, I'd favor simply having N tables, where the table at index i+1 contains 2x/4x/8x as many slots as the table at index i. It's dead simple, has fewer indirections, pointers to sub-tables are never updated once set, and the hash can be computed once and for all.

The real challenge, though, is handling deletions & dangling references without putting an undue burden on the user and incurring too much overhead. That's a tough one :/

A very simplistic Concurrent Hash Map based on Abseil's Swiss Map: https://greg7mdp.github.io/parallel-hashmap/ => simply using an array of Abseil's maps, each with a lock.

I was just reading A Practical Analysis of Rust's Concurrency Story and the brief claims:

Our code is publicly available at this https URL and is one of the fastest concurrent hashmaps for the Rust language, which leads to mitigating bottlenecks in concurrent programs.

I'm the advisor on that project, and I'd say that claim is probably unfounded, and the result of the authors being particularly excited about their results :) It does generally avoid locking, which helps it scale pretty well, but I don't think it's a design that can't be heavily optimized (for example, it currently uses lock-free linked lists for buckets).

Another example that seems to be more maintained would be evmap.

evmaps separates readers and writers, so it fits different different use cases than a single type that implements Sync and provides mutation through &self references.

Hey! We are making progress on this I think. Jonhoo ported the Java ConcurrentHashMap over to Rust and is now competing with my https://docs.rs/dashmap/3.5.1/dashmap and some other ones. Me and Jonhoo are putting together some benchmarks of all of this using a port of the libcuckoo benchmark suite.

I will make sure to post here once they are finished.

An initial benchmark suite which includes the current implementations of concurrent maps based on a port of the libcuckoo benchmark harness is available here. Though I would refrain from drawing conclusions from this benchmark right now. I have a few more specialized benchmarks to implement; DashMap is receiving a big rework soon that improves performance significantly and we are investigating Flurry's bad performance.

I ran the same benchmarks against libcuckoo and the experimental unreleased dashmap. I was surprised to see libcuckoo heavily outperformed. I still have lots of exciting optimizations to make so this is really promising.

Have you tried RwLock<HashMap> with parking_lot's RwLock? If you enable the "nightly" feature then it uses hardware lock elision on recent Intel CPUs, which allows reads to run concurrently

I have indeed, performance improved a bit but the general upwards trend was still the same. The reason for this is that the workload also does a tiny amount of write updates to simulate real world scenarios.

What is the state of this issue? Is libcuckoo the best choice for this?

https://github.com/xacrimon/dashmap is a fast and production-ready concurrent hashmap implementation.