This project provides the initial framework to write a fully functional clone of LMDB in Rust and the ability to extend the original design while ensuring memory safety and retaining the high performance.
LMDB and its variant MDBX are used by the most performant blockchain nodes. Other databases such as RocksDB use multiple files and suffer signifcant performance penalties. The database is by far the most limiting factor in a blockchain node design.
The resulting crate should be no-std capable for RTOS applications and depend only on libc under feature control.
For example, we could use tokio-uring to provide genuinely async access to LMDB from SSD without thread stalls as an alternative to memory mapping. This is something that no existing KV store will do and results in loss of CPU time when processing data.
We can also document the crate to a higher degree then the original LMDB documentation and provide support for variants such as MDBX.
For testing, we include lmdb-rs, a LMDB Rust wrapper as a submodule which in turn includes the original LMDB source code.
The LMDB format is not well documented, so we aim to improve on that in this document.
If you want to get involved in this project, contact me, Amy Thomason, on Linkedin or otherwise or join the Oxford Rust meetup.
Note: You must be able to write Rust code without adding dependencies.
A LMDB database usually consists of two files, data.mdb and lock.mdb in a directory.
The lock file controls access and allows multiple processes to read and write the database by sequentialising the transactions.
data.mdb consists of an array of typically 4k pages. Each page typically has the following
header - note that this is highly platform dependent.
graph LR
A[Page] -->|Bytes 0..8| B[page_number<br>May be 4 bytes on 32-bit machines]
A -->|Bytes 8..10| C[pad<br>Used by some page variants for extra data]
A -->|Bytes 10..12| D[flags<br>Identifies the page type and housekeeping]
A -->|Bytes 12..14| E[lower<br>Offset to the start of empty space in the page]
A -->|Bytes 14..16| F[upper<br>Offset to the end of empty space in the page]
A -->|Bytes 16..lower| G[index<br>16 bit offsets to nodes]
A -->|Bytes lower..upper| H[Free space]
A -->|Bytes upper..4096| I[Nodes]
There are three main page types although others exist to handle large keys and values.
| Type | Usage |
|---|---|
| Meta | The first two pages in the file are meta pages and point to the root of the binary tree. |
| Leaf | Leaf pages contain sorted key-value pairs |
| Branch | Branch pages contain key-page pairs where the page is a leaf or branch page. |
graph LR
A[Meta] -->|Offset 0..4| B[Magic number<br>0xBEEFC0DE]
A -->|Offset 4..8| C[Version<br>1]
A -->|Offset 8..16| D[address<br>Used if mapping to a specific address]
A -->|Offset 16..24| E[map size<br>Size of the memory map]
A -->|Offset 24..| F[free db<br>Used to store free pages?]
A -->G[main db<br>Pointer to the root page and some stats]
A -->H[last page<br>Size of the database]
A -->I[transaction id<br>Transaction id]
A leaf page has a series of nodes indexed by 16 bit values
stored at the offset 16..lower. The first node is at upper
The Node consists of the data length (up to 32 bits) and the key length followed by the key and data.
To find a key in the page, binary chop the keys.
Leaf nodes can also contain sub databases, this allows us to have named databases (tables) inside the main database.
A branch node has the same offsets at 16..lower.
After upper The nodes are stored as the branch page number
and the first key in the sub page.
Transactions manage locked access to the database file so than no process is writing while another is reading.
Every transaction is indexed starting with 0. Even transactions update Meta page 0, odd transactions update Meta page 1
Transactions maintain a list of dirty pages which are flushed on commit.
All reads and updates to the DB work via cursors. A cursor represents a location in the database and multiple cursors may be active at one time.
The LMDB database allows multiple entries to have the same key in which case entries are also sorted by value.
This is used by blockchains to create composite keys such as block/transaction.