Use concurrent (aka "lockless") hash map in the AtomTable
Opened this issue · 9 comments
The AtomTable uses a mutex to guard access to the TypeIndex. This mutex could be mostly avoided by using a concurrent hash map.
AtomTable Status: After exploring this for a while, there's a problem. It seems like what AtomTable/Typeindex really requires is a concurrent unordered multimap, with thread-safe erasure. There are 5 or 6 packages that provide concurrent hash maps, but only one provides a concurrent multimap: Intel TBB. Unfortunately, it does NOT provide a thread-safe erase. So it seems like a dead-end! There's still one possibility: use a concurrent hash map, and store concurrent linked lists as the value. Yikes! Fixed in #2907
Atom Status: The Atom (Atom.cc
) stores Values in std::map
. It does NOT use a hash table, in order to keep Atoms as small as possible. Although it could be replaced by a lockless hash table, this risks blowing up RAM usage. So: how bad would this be? How would this change the size of an Atom? Atom also uses an ordinary std::set
for the incoming set .... same as before: we want a tree here, not a hash, to keep the Atom as small as possible...
Overall status: This issue primarily affects users with highly threaded workloads, on modern multi-core (more than 8 core) machines. It appears that the mutexes are NOT the primary bottleneck; instead, its the CPU implementation of atomic increment. See issue #2758 for discussion. Conclude: a lockfree find()
in a hashset could speed things up for truly demanding users, but we don't have users like that.
Lockfree comments: Notes below review the available options. We need two things: a lockfree hash set (for TypeIndex
), and a lockfree tree (i.e. something smaller, less RAM intensive than a hash set, for the Incoming set). Current options are thin, approximating zero: most of them are hard to use or fail to meet requirements. Lockfree theory looks strong; drop-in replacements for std::set
and std::unordered_set
are missing. (fb folly comes close) Documentation and benchmarks are lacking or inadequate. This is still the bleeding edge.
Implementing this requires
- Copying the concurrent hash map source to cog-utils
- Modifying
TypeIndex
to use it. - Removing the mutex lock in
TypeIndex
Do the same for Atom.cc
Possibilities:
- https://github.com/khizmax/libcds - multiple types of lock-free structures. Has Debian packages.
- https://preshing.com/20160201/new-concurrent-hash-maps-for-cpp/ - see Junction, below. Requires QBSR
- https://github.com/mthom/managed-ctrie
- https://github.com/Taymindis/atomic_hashtable
- https://github.com/facebook/folly
- https://github.com/divfor/atomic_hash - simple but lacking in C++ compatible API.
- Intel TBB concurrent unodered multimap
https://software.intel.com/content/www/us/en/develop/documentation/tbb-documentation/top/intel-threading-building-blocks-developer-reference/containers/concurrentunorderedmap-and-concurrentunorderedmultimap-template-classes.html - https://github.com/AlexeyAB/object_threadsafe
Each seems to have limitations and usability issues... See below for some commentary.
Starting with the simplest one, even if it is not the fastest, is probably the best/safest idea.
This is important work: the current scheme bindings have a great amount of trouble parallelizing, and I think that this is due to lock collisions in the AtomTable.
No. See issue #2758 Almost all the parallelism problems are due to the CPU implementation of atomic increment. There could be some speedup from a lock-free TypeIndex
(at least if the set find()
is made lock-free), but this now seems as not being as high-priority as I first imagined.
What is actually needed is a drop-in replacement for std::unordered_multimap
and none of the solutions above provide that. However, Intel TBB does seem to provide that! However, the TBB version does not provide a thread-safe erase()
. And we need that.
Comments about the concurrent hash maps, mentioned above.
- libcds -- has Debian packages. Multiple choices of containers. All of them have difficult-to-use API's, requiring all sorts of complex compare functions aka
traits
to be provided. None are just-plain sets. Under-documented. No examples.michael_set
is fixed-size, not dynamically expandable.skip_list_set
- O(log N) perf.striped_set
-- lock stripeingfeldman_hashset
-- requires a perfect hash, and the hash is fixed size. So no strings as keys.ellen_bintree_set
-- unbalanced binary tree, O(log N) perf for uniformly distributed hash.cuckoo_set
-- seems to useunique_ptr
instead of GC
- junction -- Docs say: "To make QSBR work, each thread must periodically call
junction::DefaultQSBR.update
at a moment when that thread is quiescent" -- Each thread? Not sure how to do that, because the AtomSpace doesn't know about threads, and doesn't know when things are quiescent. So this seems like a problem w/using junction. Also: "The maps don’t currently support smart pointers." but otherwise very impressive. - Taymandis/hastable - Unclear documentation, unmaintained.
- folly -- there's no Debian package for it. Vast number of users; maintained. Several candidates:
- AtomicHashMap -- Lockless
find()
. Dropin replacement forstd::unordered_map
. Erased elements are not reclaimed, but are tombstoned forever. Bad idea for the generic AtomSpace. - F14ValueSet -- this is NOT concurrent; locks are still needed. Its just "faster". Yes, it does appear to work. Performance improvement, if any, is hard to discern.
- ConcurrentHashMap -- Lockless
find()
. Dropin replacement forstd::unordered_map
. Works. No discernible performance impact onbio-loop3.scm
benchmark.
- AtomicHashMap -- Lockless
- divfor/atomich_hash -- simple API, lacks c++ container API. Appears to be fixed-size hash table; does not auto-resize. The AtomSpace needs a resizable table.
Idea: in TypeIndex.h
, instead of declaring
typedef std::unordered_multimap<ContentHash, Handle> AtomSet;
do NOT use ConentHash
, but use the string s-exp of the Atom instead! This now becomes a plain map, instead of a multimap. .. except that the string version cannot deal with alpha-equivalent expressions. Bummer.
This could be solved by having a flag for each atom: "contains alpha terms". if set to true, fallback to a lock plus content-hash. else use the lock-free lookup! That might work!
It would require a fast atom->short-string function, however. (This could be made fast by storing substrings in RAM, but this blows up RAM usage even more.)
FIXED in #2902 the multimap design was insane.
In TypeIndex.h
, there was no reason for having an unordered multimap. I removed it in pull req #2902 replacing with with an unordered set. This now opens the door for the concurrent data structures. The TypeIndex
set replacement needs to support:
- insertion
- deletion
- find
After the cleanup of #2907 (and the four prior pull reqs) experimenting with alternatives is now possible.
Confusion reigns: Recompiled both Atom.h
and TypeIndex.h
with the locks completely disabled. Then ran the opencog/benchmark
bio-loop3.scm
benchmark and saw only minor (almost un-noticeable) parallelism improvements. So whatever is causing the lock contention, its not these two.
Double-check these results: using folly:ConcurrentHashmap
in TypeIndex
(and removing the locks) also has no effect on parallelism.
Some mutrace
experimental results.
For the locked version: the lock in TypeIndex::findAtom()
is hit 19 million times, and this accounts for almost all of the unserializable time in the parallel bio-loop3.scm
benchmark. Most of the calls are coming from PatternMatchEngine.cc:1337
, this snippet of code:
// Yuck. What we really want to do here is to find out
// if `Link(t, oset)` is in the incoming set of `hg`. But
// there isn't any direct way of doing this (at this time).
Handle hup(_pmc.get_link(hg, t, std::move(oset)));
and pmc.get_link()
just calls as->get_link()
which calls TypeIndex::findAtom()
The unlocked versions show no lock contention. Conclude: it is not lock contention that is slowing us down.
Latest theory: The smart pointers and other assorted atomics in the code are used so frequently, that they are causing cache-line contention for the locks. (CPU's implement atomics inside of cache-lines.) Bingo! That's exactly it! Things parallelize nicely on modern CPU's! See issue #2758 for a general discussion.