A simple project using Pyspark to transform data.
This project uses Pyspark and Delta Tables to load csv files and transform them into delta table data lake.
I used Olist dataset. You must download it and extract the csv files into the /data/stage folder.
Use Poetry to configure the project.
poetry lock
poetry install
In order to run the pipeline the project has a Papermill workflow.
Goes to:
cd src/utils
Run the code below in order to execute all three layers. You can run any layer add or removing parameters: brz, slv, gld.
python orchestration.py brz slv gld