Store measurements per-thread and aggregate separately

Question

Store measurements per-thread and aggregate separately

Opened this issue 8 years ago · 1 comments

Idea from Benedict Elliott Smith:

While HdrHistogram and LongAdder are each reasonably fast, even under contention, they still have a bunch of associated costs: cache misses being the worst, which is compounded by their attempts to mitigate the effects of concurrency (i.e. by bloating their state space, with many padded values). Benchmarking the effects of this are hard, because in a microbenchmark the cache behaviour is a lie, and in macrobenchmarks attribution of the cost of the cache effect is hard to attribute. Either way, atomic instructions are an order of magnitude costlier than counter increments.

Long story short, the simplest metrics collection solution I can think of that is universally viable is: a pair of per-thread event logs of metrics, to which we append an id and value for each event (optionally also a time, and this becomes viable for extremely granular analysis, though persisting to graphite is obviously off the table).

A background thread periodically grabs the inactive log and pushes the values into its exclusively owned aggregation structures - for counters a single variable, and for histograms a plain regular HdrHistogram. A majority of these structures will fit into L1 of this thread (if not, it's surmountable by having separate logs for a certain number of metrics)

When it's done, it updates a flag in the application thread's structure indicating it should swap its active log next event it encounters. When this happens, the application thread updates a flag indicating it has done so - at this point, the background thread is happy to again aggregate the thread's inactive log (if it's gotten round to it yet).

End result: Coordination costs are amortized across thousands of events, and are anyway minimal - no atomic operations or competition are possible. Cache behaviour is optimal.

Answer 1 · 2017-01-14T13:05:31.000Z

Storing raw events without initial aggregation is very dangerous, and should be implemented with careful:

The OutOfMemoryException can happen in case of unpredicted application behaviour, for example infinite retry of any action which fail fast with NullPointerException. So at least application thread should be able to do aggregation by itself in case of log size threshold achieved. And threshold will be a problem by itself. It should be small enough to avoid unpredictable latency, when application thread is doing aggregation instead of business related work. And it should be big enough to provide guarantee that aggregation mostly done in separated aggregation thread instead of application business thread.
This way can take negative effect to garbage collector, at least for CMS. Garbage collectors are not tolerant to big traffic of middle age objects. If events will survive several collection cycles then GC will spent time to copy objects between survival regions on each cycle, and in worst case events can be evicted to tenured space, in result major collections will occur more often than it should be. The cyclic buffer can be used to avoid floating garbage. However per-thread buffers will be the problem by itself, it will be too difficult to choose correct buffer size. Too small size will lead to event loss, too big buffer can lead to OOM in case of legacy style applications which uses hundreds or thousands threads.
If no events in the application thread, then incomplete log will not visible to aggregation thread.
In addition, actuality of metrics will depend from log rotation interval, this mean that ad-hoc queries will not show actual state. For example, when log rotation is 30 second then administrator which want to see data through JMX console(or another admin panel) will not get latest data for 15 second in average, because fresh events stay in per thread buffers.

And in conclusion I want to notice that for well written monitoring structures (like HdrHistogram) the contention problem is not actual for real world applications, in my opinion. Because applications mostly spent CPU time in doing business related work. I did profiling of my applications many times, and I never seen call trees related to monitoring in the tops. According to Amdahl's law, even if monitoring will be totally optimized by reducing its latency to zero(just imagine), it will not change anything in the application performance if share of monitoring is less than 1%.