Use concurrent (aka "lockless") hash map in the AtomTable

Question

Use concurrent (aka "lockless") hash map in the AtomTable

Opened this issue 5 years ago · 9 comments

The AtomTable uses a mutex to guard access to the TypeIndex. This mutex could be mostly avoided by using a concurrent hash map.

AtomTable Status: After exploring this for a while, there's a problem. It seems like what AtomTable/Typeindex really requires is a concurrent unordered multimap, with thread-safe erasure. There are 5 or 6 packages that provide concurrent hash maps, but only one provides a concurrent multimap: Intel TBB. Unfortunately, it does NOT provide a thread-safe erase. So it seems like a dead-end! There's still one possibility: use a concurrent hash map, and store concurrent linked lists as the value. Yikes! Fixed in #2907

Atom Status: The Atom (Atom.cc) stores Values in std::map. It does NOT use a hash table, in order to keep Atoms as small as possible. Although it could be replaced by a lockless hash table, this risks blowing up RAM usage. So: how bad would this be? How would this change the size of an Atom? Atom also uses an ordinary std::set for the incoming set .... same as before: we want a tree here, not a hash, to keep the Atom as small as possible...

Overall status: This issue primarily affects users with highly threaded workloads, on modern multi-core (more than 8 core) machines. It appears that the mutexes are NOT the primary bottleneck; instead, its the CPU implementation of atomic increment. See issue #2758 for discussion. Conclude: a lockfree find() in a hashset could speed things up for truly demanding users, but we don't have users like that.

Lockfree comments: Notes below review the available options. We need two things: a lockfree hash set (for TypeIndex), and a lockfree tree (i.e. something smaller, less RAM intensive than a hash set, for the Incoming set). Current options are thin, approximating zero: most of them are hard to use or fail to meet requirements. Lockfree theory looks strong; drop-in replacements for std::set and std::unordered_set are missing. (fb folly comes close) Documentation and benchmarks are lacking or inadequate. This is still the bleeding edge.

Answer 1 · 2020-04-26T00:56:51.000Z

Implementing this requires

Copying the concurrent hash map source to cog-utils
Modifying TypeIndex to use it.
Removing the mutex lock in TypeIndex

Do the same for Atom.cc

Answer 2 · 2020-04-26T01:07:51.000Z

Possibilities:

https://github.com/khizmax/libcds - multiple types of lock-free structures. Has Debian packages.
https://preshing.com/20160201/new-concurrent-hash-maps-for-cpp/ - see Junction, below. Requires QBSR
https://github.com/mthom/managed-ctrie
https://github.com/Taymindis/atomic_hashtable
https://github.com/facebook/folly
https://github.com/divfor/atomic_hash - simple but lacking in C++ compatible API.
Intel TBB concurrent unodered multimap
https://software.intel.com/content/www/us/en/develop/documentation/tbb-documentation/top/intel-threading-building-blocks-developer-reference/containers/concurrentunorderedmap-and-concurrentunorderedmultimap-template-classes.html
https://github.com/AlexeyAB/object_threadsafe

Each seems to have limitations and usability issues... See below for some commentary.

Starting with the simplest one, even if it is not the fastest, is probably the best/safest idea.

Answer 3 · 2021-07-16T00:11:02.000Z

~~This is important work: the current scheme bindings have a great amount of trouble parallelizing, and I think that this is due to lock collisions in the AtomTable.~~

No. See issue #2758 Almost all the parallelism problems are due to the CPU implementation of atomic increment. There could be some speedup from a lock-free TypeIndex (at least if the set find() is made lock-free), but this now seems as not being as high-priority as I first imagined.

Answer 4 · 2021-07-16T19:45:43.000Z

What is actually needed is a drop-in replacement for std::unordered_multimap and none of the solutions above provide that. However, Intel TBB does seem to provide that! However, the TBB version does not provide a thread-safe erase(). And we need that.

Comments about the concurrent hash maps, mentioned above.

libcds -- has Debian packages. Multiple choices of containers. All of them have difficult-to-use API's, requiring all sorts of complex compare functions aka traits to be provided. None are just-plain sets. Under-documented. No examples.
- michael_set is fixed-size, not dynamically expandable.
- skip_list_set - O(log N) perf.
- striped_set -- lock stripeing
- feldman_hashset -- requires a perfect hash, and the hash is fixed size. So no strings as keys.
- ellen_bintree_set -- unbalanced binary tree, O(log N) perf for uniformly distributed hash.
- cuckoo_set -- seems to use unique_ptr instead of GC
junction -- Docs say: "To make QSBR work, each thread must periodically call junction::DefaultQSBR.update at a moment when that thread is quiescent" -- Each thread? Not sure how to do that, because the AtomSpace doesn't know about threads, and doesn't know when things are quiescent. So this seems like a problem w/using junction. Also: "The maps don’t currently support smart pointers." but otherwise very impressive.
Taymandis/hastable - Unclear documentation, unmaintained.
folly -- there's no Debian package for it. Vast number of users; maintained. Several candidates:
- AtomicHashMap -- Lockless find(). Dropin replacement for std::unordered_map. Erased elements are not reclaimed, but are tombstoned forever. Bad idea for the generic AtomSpace.
- F14ValueSet -- this is NOT concurrent; locks are still needed. Its just "faster". Yes, it does appear to work. Performance improvement, if any, is hard to discern.
- ConcurrentHashMap -- Lockless find(). Dropin replacement for std::unordered_map. Works. No discernible performance impact on bio-loop3.scm benchmark.
divfor/atomich_hash -- simple API, lacks c++ container API. Appears to be fixed-size hash table; does not auto-resize. The AtomSpace needs a resizable table.

Answer 5 · 2021-11-21T20:32:54.000Z

Background reading:

Answer 6 · 2021-11-21T20:38:53.000Z

~~Idea: in TypeIndex.h, instead of declaring~~

typedef std::unordered_multimap<ContentHash, Handle> AtomSet;

do NOT use ConentHash, but use the string s-exp of the Atom instead! This now becomes a plain map, instead of a multimap. .. except that the string version cannot deal with alpha-equivalent expressions. Bummer.

~~This could be solved by having a flag for each atom: "contains alpha terms". if set to true, fallback to a lock plus content-hash. else use the lock-free lookup! That might work!~~

~~It would require a fast atom->short-string function, however. (This could be made fast by storing substrings in RAM, but this blows up RAM usage even more.)~~

FIXED in #2902 the multimap design was insane.

Answer 7 · 2021-11-24T18:44:56.000Z

In TypeIndex.h, there was no reason for having an unordered multimap. I removed it in pull req #2902 replacing with with an unordered set. This now opens the door for the concurrent data structures. The TypeIndex set replacement needs to support:

insertion
deletion
find

After the cleanup of #2907 (and the four prior pull reqs) experimenting with alternatives is now possible.

Answer 8 · 2021-11-27T03:47:31.000Z

Confusion reigns: Recompiled both Atom.h and TypeIndex.h with the locks completely disabled. Then ran the opencog/benchmark bio-loop3.scm benchmark and saw only minor (almost un-noticeable) parallelism improvements. So whatever is causing the lock contention, its not these two.

Double-check these results: using folly:ConcurrentHashmap in TypeIndex (and removing the locks) also has no effect on parallelism.

Answer 9 · 2021-11-28T06:56:09.000Z

Some mutrace experimental results.

For the locked version: the lock in TypeIndex::findAtom() is hit 19 million times, and this accounts for almost all of the unserializable time in the parallel bio-loop3.scm benchmark. Most of the calls are coming from PatternMatchEngine.cc:1337, this snippet of code:

         // Yuck. What we really want to do here is to find out
         // if `Link(t, oset)` is in the incoming set of `hg`. But
         // there isn't any direct way of doing this (at this time).
         Handle hup(_pmc.get_link(hg, t, std::move(oset)));

and pmc.get_link() just calls as->get_link() which calls TypeIndex::findAtom()

The unlocked versions show no lock contention. Conclude: it is not lock contention that is slowing us down.

Latest theory: The smart pointers and other assorted atomics in the code are used so frequently, that they are causing cache-line contention for the locks. (CPU's implement atomics inside of cache-lines.) Bingo! That's exactly it! Things parallelize nicely on modern CPU's! See issue #2758 for a general discussion.