rust version of rocksdb
RocksDB is a common data engine for multiple kinds of database, and one of the most important applications among them is MyRocks, which is the kernel engine to replace InnoDB in MySQL. Obviously, most users in RocksDB community do not need a transaction engine for MySQL, we just want a simple but well-performed KV engine. The RocksDB has merged so many features which we may never enable them and they made this project hard to maintain. I want to build a simple engine which is easy to maintain for simple KV application.
RocksDB does not support asynchronous IO. Not only for IO, but also other method such as Ingest
and CreateColumnFamily
are
also synchronous. It means that every method may block the user thread for a long time. In cloud environment, this problem may
be worse because the latency of cloud disk is much higher than local NVMe SSD.
Our engine has five main modules, which are WAL
, MANIFEST
, Version
, Compaction
, Table
.
WAL
module will assign sequence for every write and then write them into a write-ahead-log file. It will run as an independent future task, and some other jobs may also be processed in this module, such as ingest. You can think of him as a combination ofwrite_thread
andWriteToWAL
in RocksDB. The format of file is compatible with RocksDB, so that we can start this engine at the RocksDB directory.MANIFEST
will persist changes for SST files, include the result of compaction and flush jobs.- The most important structure of
Version
module areVersionSet
andKernelNumberContext
. I split them fromVersionSet
of RocksDB. If one operation can convert to an atomic operation, I store it inKernelNumberContext
, otherwise it must acquire a lock guard forArc<Mutex<VersionSet>>
.VersionSet
will manage the info ofColumnFamily
and everyColumnFamily
will own aSuperVersion
, which include the collection ofMemtable
and the collection ofSSTable
.SuperVersion
consists ofMemtableList
andVersion
, every time we switch memtable for oneColumnFamily
, we will create a newSuperVersion
with the newMemtable
and the oldVersion
. Every time we finish a compaction job or a flush job, we will create a newSuperVersion
with the oldMemtable
and the newVersion
. Compaction
module consists of all codes forCompaction
andFlush
.Table
module consists of the SSTable format and the read/write operations above it.
- refactor compaction pickup strategy and calculate the effect of deleted keys.
- Support LZ4 and ZSTD compression algorithm.
- Support hash-index for small data block.
- Support block-cache.
- Support AIO for asynchronous IO. (I use user threads as independent io threads, but I'm not sure if it's a better solution than AIO.)