Skyvault is a high-performance, scalable object-store backed key-value store.
Skyvault is supposed to be the low latency high QPS serving layer for your data. Imagine an AI application computing features in some offline system and then loading them to skyvault to serve them on the live path.
Skyvault is supposed to be single tenant system. Idea is for each tenant you will deploy it to a new k8s namespace. In each deployment, key-values are organized by table. You can write and read from multiple tables in a single query at a given snapshot. This way you can build secondary indexes by writing to primary/indexes table atomically.
Write path
Writes are batched across tables until the batch reaches a certain size or until we hit a timeout. Once batch is ready, it is written to Write ahead log (WAL).
Read path
On reads, we take a snapshot of the whole LSM tree and merge values across
- All Write ahead log runs.
- All the table buffer runs.
- One run at each level of the table LSM tree.
Background compactions
Orchestrator is constantly firing k8s jobs that are pushing data down from
- WAL -> Table buffers
- Table buffers -> Table LSM tree
- Table LSM tree levels
Runs
run represents a SST file in the objectstore that makes up our LSM tree. Its immutable and once associated
with the tree in a certain location it never changes.
Changelog
Changelog represents the changes we are making to the forest (WAL and all the table LSM tree are collectively called forest).
Snapshots
To avoid changelog from going forever, we periodically snapshot the forest state and dump it in object store.
To load current state of forest, reading latest snapshot and the changelog since the snapshot should give you the full picture.
Tables
Stores table configuration like time-to-live, max LSM tree levels and so on.
Jobs
Used to track background jobs for observability.
| Technology | Description |
|---|---|
| Production | |
| Tonic | High performance gRPC framework for Rust |
| PostgreSQL | Open source relational database |
| SQLx | Async SQL toolkit for Rust |
| Kubernetes | Container orchestration platform |
| Helm | Package manager for Kubernetes |
| Sentry | Error tracking and performance monitoring |
| Development only | |
| Minikube | Local Kubernetes implementation |
| MinIO | High performance object storage |
| Docker | Container platform |
| Cursor Editor | AI-powered code editor used |
| Just | Command runner for development tasks |
- Rust (nightly)
- Protobuf compiler (protoc)
- Docker, k8s, minikube cluster, just and helm for local development
- PostgreSQL instance database for SQLx compile-time query checking
- Run
just buildto build skyvault and push container image to minikube. - Run
just deployto start everything in k8s. - Run
just smoketo run some simple smoke tests against this.
- Alki - Cost-efficient petabyte-scale metadata store using LSM-tree architecture
- Procella - YouTube's analytical data warehouse unifying serving and analytical data
- Napa - Scalable data warehousing system with robust query performance
- Snowflake - Cloud data platform with separated storage and compute
- Quickwit - Cloud-native search engine built on object storage
- SingleStore - Distributed SQL database with columnstore and rowstore engines
- Rockset - Real-time analytics database with converged indexing
- InfluxDB IOx - Time series database built on Apache Arrow and DataFusion
- Firebolt - Cloud data warehouse with columnar storage
- Datadog Husky - Time series database for metrics storage
- Elastic Stateless - Stateless Elasticsearch architecture
- GreptimeDB - Cloud-native time series database
- Grafana Mimir - Scalable long-term storage for Prometheus
- Slack Astra - Structured log search and analytics engine
- Milvus - The High-Performance Vector Database Built for Scale
- OpenSearch RFC - Cloud-native OpenSearch architecture
See our Security Policy for reporting security vulnerabilities.
This project is licensed under the terms in the LICENSE file.



