Rewriting SQLite in Rust for Learning and for Fun.
This was inspired by
- CodeCrafters SQLite programming challenge, please pay them a visit.
- SQLite detailed documentation.
- Apache Arrow DataFusion.
- Let's Build a Simple Database
docs # detailed doc for implementation and design records, step by step guidelines, module walkthrough, etc.
readings # related more comprehensive readings like books and articles
src # source code with unit tests, for a more detailed module description, look at Architecture section
bin # binary cli entry point with main function
access # access layer
concurrency # handling concurrency control: transactions for example
logical # things with logical layer like logical plan, not much here as we use arrow-datafusion for this
model # main domain model of sqlite database like TableLeafCell that mapped to sqlite3 doc concepts
physical # things related to physical planning and execution
wal # Write Ahead Logging for atomicity, recovery, etc.
storage # module handling physical storage to file on disk
util
presentation.rs # how sqlite present returned result to cli stdout (rows)
varint.rs # varint encode and decode
tests # integration tests and test resources
# execute program against a sqlite database
# this table has 4 rows
cargo run -- sql tests/resources/sample.db "select name from apples"
# this table has 6895 rows and span > 1 db page
cargo run -- sql tests/resources/superheroes.db "select * from superheroes"
# suppress warnings
RUSTFLAGS=-Awarnings cargo run -- sql sample.db "select name from apples"
# see what returns by sqlite
sqlite3 sample.db "select * from apples"
# output >= debug logs
export RUST_LOG=debug
sqlite3 sample.db "select * from apples"
RUST_LOG=debug sqlite3 sample.db "select * from apples" # this also works
Testing
# run tests
cargo test
# showing warnings and stdout
cargo test -- --nocapture
# run all tests with prefix test_move_to_right and show print output
cargo test move_to_right -- --nocapture
Sample.db schema
- apples: id integer primary key, name text, color text
- oranges: id integer primary key, name text, description text
Differences to SQLite official implementation:
- Database Frontend (Tokenizer, Parser) is replaced by DataFusion.
- Virtual Machine is replaced by using DataFusion and a custom Physical Layer for query processing and execution.
Layers
SQL String
-----Logical Layer-----
Query Planning: Tokenizer, Parser (datafusion)
Logical Plan (datafusion)
-----Physical Layer-----
Physical Planner (custom)
Physical Plan (custom)
-----Access Layer-----
BTree Module (custom)
Buffer Pool (custom)
Concurrency Control (custom)
WAL Write Ahead Logging (custom)
-----Storage Layer-----
Disk Manager
-----Physical File------
File (e.g. on disk) following SQLite database file format
Logical Layer
- This layer is responsible for interpreting the SQL string and converting it into a logical plan that represents the operations to be performed on the data.
- It involves some important steps
- Tokenizer: SQL string is tokenined into tokens.
- Parser: Tokens are used to build Abstract Syntax Tree (AST).
- Query Planning: AST is transformed into Logical Plan.
- DataFusion library is used for this layer.
Physical Layer
- Physical Planner: takes the
LogicalPlan
ofarrow-datafusion
and transform it to an executable physical plan calledExec
. - why physical planning of
arrow-datafusion
is not used?- custom-built in order to custom this layer to have SQLite functionalities. For example, physical plan of a table scan will scan the table in the database file in SQLite format.
Access Layer
- The access layer is responsible for managing how data is accessed and manipulated. This includes managing data structures like B-Trees, handling concurrency to ensure data integrity, and handling recovery and consistency in case of system failures.
- BTree Module
- managing the B-Tree data structure used for storing and retrieving data.
- Buffer Pool
- The Buffer Pool is a cache of data that resides in memory for faster access, for locking, transaction managemeng, etc.
- When data is read from the disk, it is first loaded into the buffer pool.
- It also handles the replacement policy when the buffer is full, typically using an LRU (Least Recently Used) policy.
- Concurrency Control
- ensures that multiple concurrent operations (e.g. write) can happen and do not impact data integrity (data corruption, missing amount, etc.).
- Write Ahead Logging
- providing Atomicity, Recovery, etc. for the db.
- managing recovery process and ensure consistency in case of failures/ crashes.
- uses techniques like logging (e.g. Write Ahead Logging) and periodic checkpoints.
Storage Layer
- Disk Manager
- logical abstraction over physical file system and disk access
- provides interfaces of physical disk operations: reads, writes, flushes, etc.
Physical File: actual file in sqlite3 file format.
Sequence Diagram: SQL String to returned result
TODO
Schema
- sqlparser-rs and datafusion seems not having knowledge re primary key and auto-increment.
Parsing DDL which has
id integer primary key autoincrement
lost knowledge ofprimary key autoincrement
.
Data types
- Arrow supports Utf8 only. Sqlite has Text in (UTF-8, UTF-16BE or UTF-16LE) so only utf8 is supported.
- Paper - Architecture of a Database System (2007). Overview of important components to relational database systems.
- Book - SQLite Database System Design and Implementation, Sibsankar Haldar (2016).
- Article - Series: What would SQLite look like if written in Rust?