✔️ complete
Create and transform data from covid19 from raw layers to refined and trusted layers.
Make sure you have docker installed in your OS
git clone https://github.com/langulski/Pyspark-Covid.git
Before composing docker, set your .env variables:
AIRFLOW_UID=1001
POSTGRES_USER='airflow'
POSTGRES_PASSWORD='airflow'
POSTGRES_DB='airflow'
_AIRFLOW_WWW_USER_USERNAME='airflow'
_AIRFLOW_WWW_USER_PASSWORD='airflow'
_AIRFLOW_WWW_USER_EMAIL=email
Now you are able to deploy Airflow executing the command line
docker-compose up
- Extract data from raw;
- Create pyspark script to transform data to refined and trusted layers;
- Create Airflow tasks;
- Check results;
How the pipeline works:
- Checking if file exists inside raw folder
- task group (extract and transform to processed layer)
- transform data to 'trusted'
- transform data to 'refined'
Transforming data columns to rows was somewhat challenging, and since pyspark does not have a native 'melt dataframe' function it was necessary go through stackoverflow for solutions.
Lucas Angulski