galois: A Python repository from joaoflf

A simple document store that can store, retrieve, and query JSON documents using a custom query language.

🎯 Features • ⚙️ Installation • ⚡️ Quickstart • 🔤 Query Language • 🤖 Benchmarks • 🏈 Gameplan

🎯 Features

This project is part of the application of my learnings from the book Designing Data-Intensive Applications:

Data store supporting collections of JSON documents
CRUD operations for collections and JSON documents
Custom query language to search and filter documents based on their properties

⚙️ Installation

This project is setup as a python package.

# install from source
git clone git@github.com:joaoflf/galois.git
cd galois 
pip install .

⚡️ Quickstart

from galois.database import Database

db = Database("music_library")
tracks = db.get_collection("tracks")

query = "(AND (NOT duration_ms=120000) (OR tempo<120 key>4))"
tracks.find(query)

🔤 Query Language

A simple, yet powerful, language designed to query JSON documents. The syntax is inspired by Lisp and offers logical AND, OR, and NOT operations, as well as comparison operators =, <, and >.

Logical Expressions

Logical expressions are used to combine other expressions with a logical operator: AND, OR, or NOT.

The syntax is as follows:

(OPERATOR EXPRESSION1 EXPRESSION2 ...)

Example:

(AND (OR name=John name=Jane) age=30)

Comparison Expressions

Comparison expressions are used to compare a field in the document with a value.

The syntax is as follows:

field OPERATOR value

Example:

age>30

Semantics

AND: All of the expressions must be true.
OR: At least one of the expressions must be true.
NOT: The expression must be false.
=: The field in the document must equal the value.
<: The field in the document must be less than the value.
>: The field in the document must be greater than the value.

🤖 Benchmarks

We used a dataset containing nearly 600k Spotify tracks to perform various benchmarks. This dataset can be found on Kaggle.

For simplicity, only the artists, name, loudness, duration_ms, key, and tempo columns were selected from the dataset.

Document Insertion

Task: Create and write 586,672 documents to disk, including unique ID generation.

Time taken: 95,291ms

Querying

Task: Perform the following query and return matching documents:

(AND (NOT duration_ms=120000) (OR tempo<120 key>4))

Time taken: 15,547ms

🏈 Gameplan

Implementing CRUD (Create, Read, Update, Delete) operations for JSON documents and collections ✅
Implement a unique id generator similar to MongoDB ✅
Design custom query language to search and filter documents based on their properties ✅
Implement query language parser to an Abstract Syntax Tree (AST) ✅
- Implemented a rudimentary parser based on regexes because the focus of the project is data management. In the future could use a context-free grammar parser like Bison
Implement the AST executor ✅
Perform some load benchmarks ✅