A WAL-is-data engine that used to store multi-raft log
Currently, we use an individual RocksDB instance to store raft entries in TiKV, there are some problems to use RocksDB to store raft entries. First entries are appended to WAL and then flushed into SST, which are duplicated operation in this situation. On the other hand, writing amplification hurts performance when compaction. Writing is also needed when dropping applied entries. Disk read is also needed when compaction.
RaftEngine is aimed at replacing the RocksDB instance to store raft entries. Here I will explain how RaftEngine works and what we benefit.
First of all, all operations on RaftEngine are thread safe.
RaftEngine contains two major components, pipe_log
and memtable
.
pipe_log
is consists of a series of files, which only supports appending write and all regions share the same pipe_log. pipe_log
is where data persist.
Each region has an individual memtable
. A memtable
is consists of three members: entries
, kvs
, and file_index
.
entries
: aVecDeque
, which stores active entries or the position inpipe_log
of these active entries for this region. Operation onVecDeque
is fast.kvs
: a HashMap, stores key-value pairs for this region.file_index
: anotherVecDeque
, store which files contain entries for this region, it is used to GC expired files.
LogBatch
is similar to rocksdb’s WriteBatch, giving write an atomic guarantee. It supports adding entries, putting key-value pairs, deleting keys, and the command to clean a region.
compact_to(region_id, to_indx)
only drops entries in memtable, and update memtable’s file_index
. This operation is only in memory and is very fast.
When the engine restarts we read files in pipe_log
one by one, and apply each LogBatch to memtable
for each region. After that, we get the first index info from kv engine for each region and use this first index to compact memtable
.
Currently, we have two RocksDB instances, and each one has its own WAL. When kv engine’s WAL lost some data after the machine's crash down, the applied index which persisted in kv engine maybe fallback, but raftdb’s WAL says these entries have been dropped, which will cause panic.
RaftEngine rarely meets such a situation because RaftEngine will keep the latest several files.
As I have tested, recovering from 1GB pipe_log
data takes 4.6 seconds, which is acceptable.
After using crc32c
, recover from 1.37GB log files takes 2.3 seconds.
Add entries or key-value pairs into LogBatch, LogBatch will be encoded to a bytes slice, and append to the currently active file. The previous writing will return the file number which serves the writing, and then apply this LogBatch to associated memtable
s.
We use memtable
’s file_index
to get which file is the oldest file that contains active entries for each region, and files before the oldest file of all regions who can be dropped.
RaftEngine has a total-size-limit parameter, which is default 1GB. And when the total size of files reaches this value, we will check whose entries left behind so long.
If a region doesn’t have new entries for a long time, and the number of entries doesn't reach gc_raftlog_threshold
, we need to rewrite these entries to the currently active file, and the old files can be dropped as soon as possible. Since the amount of entries is small, rewriting is trivial.
If a region’s follower lefts behind a lot, causing entries in this region can’t be compacted. We hint raftstore
to compact this region’s entries, since we wait for the slow follower so long, “total-size-limt’s” time.
Reduce above 60% raft associated IO. Reduce disk load and disk IO utilization by 50% at lease. Increase data importing speed around 10%-15%.