data-engineering-gcp: A Python repository from Westerley

This project uses the titanic dataset as an example.

This project involves the implementation of the following examples.

A function triggered from a pub/sub event and stores the data into raw data lake zone.
An airflow pipeline with a dataproc cluster that will process the file and save it in the data lake processing zone. This step will create two dataset file (passengers.parquet, tickets.parquet)
A function triggered from a bucket storage event and stores the data into Bigquery.
An airflow pipeline that will load the data from the data lake processing zone to the Bigquery.

After setting up the roles, save the key JSON in the config folder. Rename the key file to 'service-account.json'

docker build -t westerley/docker-airflow:2.0.1 .
docker-compose -f docker-compose-CeleryExecutor.yml up -d --build

Westerley/data-engineering-gcp