This code aims to provide simple high-performance counters for multithreaded C++ software.
To get correct counters that can be updated from multiple threads simultaneously either requires locks or atomic operations. Atomic operations sound fast ('because no locks') but in fact are not.
Every time a counter is updated atomically, it requires the CPU to own that cacheline exclusively. If multiple CPUs are attempting to update the same counter at the same time they keep fighting over the cacheline.
When two threads do this 200 million times on my system, this takes 4 whole seconds, or 50 million operations/second. When further threads are added, performance gradually goes down to 45 million operations/second.
This means that if a process updates 10 counters per packet, whatever you do, you'll never scale beyond 5 million packets/second, no matter how many cores you throw at it - assuming the rest of your code takes no time at all! (Note that one might expect several counters to not interfere with each other, but the cache line communications appear to be limited to around 50 million/second on my system, no matter how many lines these are spread over).
Meanwhile, when two threads update unrelated 64 bit counters, they achieve over a billion updates per second. Using 8 threads on my 4-core system improves that number to 3 billion updates per second.
Per-thread counters are super fast, but not what we need - our process needs to report what it is doing from an operator's perspective, and the operator does not care about our threads.
This code assumes you'll have a relatively low number of threads ('dozens to hundreds'), and that counters will be updated furiously, but read only sporadically.
Sample code for a single counter:
UnsharedCounterParent ucp;
void worker()
{
UnsharedCounter uc(&ucp);
for(int n = 0 ; n < 1000000000; ++n)
++uc;
}
int main() {
std::thread t1(worker), t2(worker);
t1.join();
t2.join();
cout << ucp.get() << "\n"; // prints 2000000000
}
It is safe to call ucp.get()
at any time. However, an UnsharedCounter
should never be updated from more than one thread at once.
Real code frequently keeps dozens of counters, often organized in a struct. To make life easy, code is provided to make such struct unshared updatable as well:
struct MyCounters
{
uint64_t a;
uint64_t b;
};
UnsharedCounterStructParent<MyCounters> myc;
void worker()
{
auto uc = myc.getLocal();
for(unsigned int n = 0; n < 1000000000; ++n)
++uc.d_value.a;
}
int main() {
std::thread t1(worker), t2(worker);
t1.join();
t2.join();
cout << myc.get().a << "\n"; // prints 2000000000
cout << myc.get().b << "\n"; // prints 0
}
For UnsharedCounter(Struct)
instances, ensure that the UnsharedCounter(Struct)Parent
does not ever get destroyed before its children are. This requirement is easy to meet if the Parent is always in global scope, whereas its children are always in function scope.
UnsharedCounter(Struct)
instances must always be destroyed properly. If for example a thread is killed using pthread_kill
(which you should not do), the destructor may not get called. This will leave the parent pointing to bad data.
The struct behind an UnsharedCounterStruct
most consist exclusively of 64 bit counters. Any deviation from this will mess up the addition of structs, which is done on a per-64 bit basis.
The UnsharedCounter objects use non-atomic counters. These counters are however read from other threads, with no locking. As long as counters do not straddle cache lines, this is safe. Given that we use 64 bit counters, on structs that will be 64 bit aligned, this should be true. Even if it isn't true, the effect will only be momentarily.
The code currently specifies counters as 'volatile'. This is not strictly necessary, but does remove some surprises. If counters would not be volatile, compilers would be free to move a counter to a register and update it there. This means the Parent object may not 'see' updates for prolonged amounts of time.
If you know what you are doing, you can remove 'volatile' and gain another speedup. Note that benchmarking then becomes very difficult as compilers tend to "see what you are doing", and optimize your loops away.
As noted, these counters are aimed to support typical numbers of threads,
frequent updates and infrequent reads. Zooming in a little bit, the actual
limit is not so much on the number of threads, but the number of
UnsharedCounter
instances.
On creation of an UnsharedCounter
it reports to its parent, and this
involves a lock. If such an instantiation happens once for every thread, the
overhead is minimal. So in other words, aim for long-lived UnsharedCounter
instances.
Retrieving the value of a counter (or a set of counters in struct) similarly
involves taking a lock, as the list UnsharedCounter
s needs to be
traversed.