/datapyground

Easy to study Data Platform for fun and profit

Primary LanguagePythonMIT LicenseMIT

DataPyground

DataPyground

Tests Coverage

Data Analysis framework and Compute Engine for fun, it was started as a foundation for the How Data Platforms Work book associated to the Monthly Python Data Engineering Newsletter while writing the book to showcase the concepts explained in the its chapters.

The main priority of the codebase is to be as feature complete as possible while making it easy to understand and contribute to for people that have no prior knowledge of compute engines or data processing frameworks in general.

The codebase is heavily documented and commented to make it easy to understand and modify, and contributions are welcomed and encouraged, it is meant to be a safe playground for learning and experimentation.

Documentation

Each component of the data platform is self documented in a way inspired by the literate programming concept. The complete documentation is available at Documentation

For further understanding of the codebase and the concepts reading the How Data Platforms Work book is recommended.

Getting Started

Install datapyground package from pip:

pip install datapyground

Once installed refer to the Documentation of each component to learn how to use it.

Commands

DataPyground exposes some commands to play around with its features, currently the following commands are provided:

pyground-fquery

Allows to run SQL queries on CSV and Parquet files:

$ pyground-fquery -t sales=examples/data/sales.csv "SELECT Product, Quantity, Price, Quantity*Price AS Total FROM sales WHERE Product='Videogame' OR Product='Laptop' ORDER BY Total DESC LIMIT 5"
Product   | Quantity | Price | Total 
--------- | -------- | ----- | ------
Videogame | 10       | 98.31 | 983.10
Laptop    | 10       | 97.24 | 972.40
Videogame | 10       | 97.21 | 972.10
Videogame | 10       | 96.12 | 961.20
Laptop    | 10       | 92.23 | 922.30

Contributing

Contributions are welcomed and encouraged, it is meant to be a safe playground for learning and experimentation.

The only requirement is that the contributions maintain or increase the level of quality of the documentation and codebase, contributions that are not properly documented won't be merged, consider quality of docmentation more important that elegance or performance of the codebase for this project.

The contributions are currently meant to be in pure python, this does not prevent the use of c extensions and cython for performance in the future, but that will have to happen when the benefit they provide outweights the added complexity they introduce in the context of a learning project.

Setup development environment

Install uv python package:

pip install uv

Then install the dependencies and the project in editable mode:

uv sync --dev

Running tests

uv run pytest -v

Building Docs

cd docs
uv run make html

The documentation is readable at docs/build/html after being built.