/infinitree

Scalable and encrypted embedded database with 3-tier caching

Primary LanguageRustApache License 2.0Apache-2.0

Infinitree

Crates.io docs.rs Build Status MIT licensed Apache2 licensed

Infinitree is a versioned, embedded database that uses uniform, encrypted blobs to store data.

It works best for use cases with independent writer processes, as multiple writer processes on a single tree are not supported.

In fact, calling Infinitree a database may be generous, as all persistence-related operations are explicit. Under the hood, it's using serde for flexibility and interoperability with the most libraries out of the box.

Features

  • Thread-safe by default
  • Transparently handle hot/warm/cold storage tiers; currently S3-compatible backends is supported
  • Versioned data structures that can be queried using the Iterator trait without loading in full
  • Encrypt all on-disk data, and only decrypt it on use
  • Focus on performance and flexible choice of performance/memory use tradeoffs
  • Extensible for custom data types and storage strategies
  • Easy to integrate with cloud workers & KMS for access control

Example use

use infinitree::{
    Infinitree,
    Index,
    Key,
    anyhow,
    backends::Directory,
    fields::{Serialized, VersionedMap, LocalField},
};
use serde::{Serialize, Deserialize};

#[derive(Serialize, Deserialize)]
pub struct PlantHealth {
    id: usize,
    air_humidity: usize,
    soil_humidity: usize,
    temperature: f32
}

#[derive(Index, Default, Clone)]
pub struct Measurements {
    // rename the field when serializing
    #[infinitree(name = "last_time")]
    _old_last_time: Serialized<String>,

    #[infinitree(name = "last_time2")]
    last_time: Serialized<usize>,

    // only store the keys in the index, not the values
    #[infinitree(strategy = "infinitree::fields::SparseField")]
    measurements: VersionedMap<usize, PlantHealth>,

    // skip the next field when loading & serializing
    #[infinitree(skip)]
    current_time: usize,
}

fn main() -> anyhow::Result<()> {
    let mut tree = Infinitree::<Measurements>::empty(
        Directory::new("/storage")?,
        Key::from_credentials("username", "password")?
    );

    tree.index().measurements.insert(1, PlantHealth {
        id: 0,
        air_humidity: 50,
        soil_humidity: 60,
        temperature: 23.3,
    });

    *tree.index().last_time.write() = 1;
    tree.commit("first measurement! yay!");
    Ok(())
}

Versioning

Infinitree supports versioning data sets, similarly to Git does with files.

While some index fields work as snapshots (eg. Serialized<T>), and serialize the entire content on each commit, it is possible to use eg. VersionedMap<K, V> as an incremental HashMap.

Versioned types only store differences from the currently loaded state.

It also possible to restore state selectively, or create completely disparate branches of data for each commit, depending on the use case.

Caching

Data is always moved as part of objects.

This mechanism allows for indexing hundreds of terrabytes of data that span multiple disks and cloud storage platforms, while only synchronizing and loading into memory a small proportion of that.

Application developers can use fine-grained control of cache layers using simple strategies, eg. Least-Recently-Used, where recently queried objects can be stored in a local directory, while the rest is in an S3 bucket.

Object system

The core of Infinitree is an object system that stores all data in uniform 4MiB blobs, encrypted. Objects are named using 256 bit random identifiers, which have no correlation to the content. Indexing data and overlaying it on the physical objects is an interesting problem.

There are 2 types of objects in the Infinitree storage model, which are indistinguishable to the storage layer.

  • Indexes are encrypted as a 4MB unit, and support versioning of serializable data structures.
  • Storage Objects stores and encrypts chunks of data independently, located by a ChunkPointer.

In both cases, knowledge of the master, symmetric encryption key is necessary to access the stored data.

To establish a root of trust, a username/password combination is used to derive an passphrase using Argon 2. The Argon 2 output locates the so called root object, which is the root of the versioned index tree.

Since the system requires some objects to have a deterministic identifier, all objects IDs are uncorrelated with the data they store.

Ensuring integrity of data is done using an ChaCha20-Poly1305 AEAD. The ChunkPointer stores the tags for all data encrypted in storage objects, while the tags are appended to the end of all index objects.

Note that while the master key is necessary to access the root object, there are multiple subkeys used internally, which means layering other (e.g. public key) encryption methods onto data stored in indexes is safe.

For a more in-depth overview of the security and attacker model of the object system, please see the DESIGN.md document.

Warning

This is an unreviewed piece of experimental security software.

DO NOT USE FOR CRITICAL WORKLOADS OR APPLICATIONS.

License

Released under the MIT and Apache 2 licenses.

Support

If you are interested in using Infinitree in your application, and would like to work with Symmetree Research Labs on features or implementation, get in touch.