Data Lake Sandbox

Actions Status Code style: black Imports: isort pylint

The main goal of this project is to create a sandbox in order to understand more about how to create and manage data lakes. I will focus on some tutorials to build a data lake from scratch which will be in the references section.

Stacks

Stack 1 (Incomplete)

dagster + MinIO + Docker + trino

Stack 2

dagster + duckdb + S3 + terraform + localstack

Simple example following the article Build a poor man’s data lake from scratch with DuckDB. It is a good entry point for learning more about dagster. I added a docker-compose file for running localstack which adds the option to create a "persistent" s3 bucket if it is needed. I also added terraform in order to create all resources that are needed for the AWS implementation.

References