This project builds a data pipeline for loading the Chicago taxi trips dataset into BigQuery for subsequent analysis.
Project Steps:
- Terraform: Create a bucket in GCP and a dataset in BigQuery.
- AirFlow: Pipeline for loading data into the bucket and subsequently creating an external table in BigQuery.
- dbt: Create models for use in subsequent analysis.
Terraform docs
You need to:
- configure a service account in GCP
- install Google Cloud SDK
- authenticate in GCP
- install Terraform
To create the infrastructure, run the following script:
bash run_terraform.sh
Check what it is going to do and press yes
.
To destroy the created infrastructure run:
bash destroy_terraform.sh
Airflow docs
You need to place the Google credentials in the ~/.google/credentials/
directory on your machine (either local or VM).
cd ~ && mkdir -p ~/.google/credentials/
mv <path/to/your/service-account-authkey>.json ~/.google/credentials/google_credentials.json
Before running the container, remember to update:
- the
GCP_PROJECT_ID
andGCP_GCS_BUCKET
variable values in the.env
file - the
DOWNLOAD_START_DATE
,DOWNLOAD_END_DATE
,BIGQUERY_DATASET
,TABLE_ID
variable values indag__data_ingestion.py
Execution:
- Run the following command to build an image, initialize Airflow, and kick up all services:
bash run_airflow.sh
-
Login to Airflow web UI on
localhost:8080
with default credentialsadmin/admin
and run DAG nameddag__data_ingestion
-
To shutdown all Airflow services run:
bash shutdown_airflow.sh
dbt docs
Before running models please install dbt-core
or set up dbt cloud
. For more details, refer to the official documentation
Commands to run dbt models:
dbt seed
dbt build
Models overview: