Backend paused for 10 ms in response to put
kpy3 opened this issue · 4 comments
Hello,
we have riak 2.2.3 cluster and decided to extend it with riak 2.9.1 node with leveled backend. During handoff we see a lot of warnings like:
Backend 0 paused for 10 ms in response to put
but actually there is no high I/O or network activity on host, only on CPU.
So, what does that message mean? Is there any configuration options to eliminate pauses or what configuration option/host system/etc we should pay attention in that situation?
When there is a backlog of changes building up in leveled memory it will pause the vnode to allow it to catch-up.
It is configurable via riak.conf - https://github.com/basho/riak_kv/blob/develop-2.9/priv/riak_kv.schema#L155-L163
The aim of the pause is to prevent unbounded growth in memory of key changes not persisted to the ledger (the actual change will already have been persisted to the journal - so it doesn't impact data safety). If you're going to reduce the pause, monitor memory usage by the beam - and potentially keep an eye on riak-admin top
for memory usage within the beam.
You may be able to avoid the pauses by increasing the size of the leveled ledger_cache (again via riak.conf):
https://github.com/martinsumner/leveled/blob/master/priv/leveled.schema#L22-L27
If you want to go exploring in the code to see what is happening here, the pause comes from the bookie going into a slow_offer
state which is prompted as a result of a the penciller refusing the push of a ledger cache as it has a merge backlog:
https://github.com/martinsumner/leveled/blob/master/src/leveled_bookie.erl#L2233-L2266
Thanks for the response,
after inspecting node we discovered that problem was in native
compression, we switched to lz4
and it looks like warning is gone (at last there is no warnings after node re-setup and restart).
Compression is applied to each object before it is persisted, and also to the SST files in the ledger when they're written.
Switching to lz4 makes compression quicker, and takes less CPU. So the faster compression should mean that the L0 merge in particular is quicker, and you're therefore less likely to have an incomplete merge before the next attempt to push the ledger cache.
So that does make sense that changing compression, would reduce the possibility of having tha backend pause.