Demo project of a "modern" open-source data stack. Airbyte + Dagster + dbt + BigQuery. An overview of this project is posted on my website.
pip install -e ".[dev]"
In .env_template
you will find the environment variables that you need to set. Rename this file to .env
.
This setup uses Google BigQuery as a data warehouse. That means you should have a GCP account, preferably a separate project to run it. Fill in your project id and dataset id in the .env file. You can create the dataset manually in BigQuery or use the provided terraform automation (creates Airbyte EC2 as well).
Switching to Databricks, Snowflake or Postgres shouldn't be that difficult. If you want to try it out, you should manually setup up the Airbyte connectors and confirm that dbt is pointing to the correct schema.
Since Airbyte does not have a working source GCS connector, S3 is used as a source. You will need an S3 bucket with its secret and access key. The bucket content is structured as follows:
jaffle_shop/
jaffle_shop_customers.csv
jaffle_shop_orders.csv
stripe_payments.csv
With source files available here
Dagster points at the dbt profile located under dbt_project/config/profiles.yml. By default, it authorizes via gcloud CLI, but you will most likely need to execute
gcloud auth application-default login
to obtain access credentials.
Follow the instructions in the Airbyte repo to start Airbyte locally.
Alternatively, you can use the terraform scripts in the automation/terraform
folder to create the resources in GCP (along with the BigQuery dataset). Then you will need to run the following command to create an SSH tunnel to the Airbyte server:
# In your workstation terminal
SSH_KEY=~/Downloads/dataline-key-airbyte.pem
ssh -i $SSH_KEY -L 8000:localhost:8000 -N -f ec2-user@$INSTANCE_IP
or using gcloud
gcloud compute ssh --zone=us-central1-a --ssh-key-file=$SSH_KEY --project=$PROJECT_ID $INSTANCE_NAME -- -L 8000:localhost:8000 -N -f
You should now be able to access Airbyte under http://localhost:8000/
using the username and password specified in the .env
file.
Run:
python3 -m gcs_modern_data_stack.utils.setup_airbyte
This will seed Airbyte with 3 source connectors, BigQuery destination, and the connections between sources and destination. After running it, you can inspect the AirByte UI.
Hint: Make sure the setup IDs made it to the .env file. I stumbled upon an issue where I had .env file opened in my IDE which was preventing automation from writing to the file.
Hint2: there's also a gcs_modern_data_stack.utils.teardown_airbyte
if you need to redo the auto setup.
Run the Dagster server
dagster dev
And materialize all the assets. If you've done everything correctly, you should be able to push the test data through and play around!
Make sure you have the tunnel running (Airbyte is deployed on GCP) and the Airbyte server is up and running. Stop the dagster server, run the tunnel command again and start the dagster server in the same terminal session.
Encountered an error:
Parsing Error
Env var required but not provided: 'BQ_TARGET_PROJECT_ID'
Make sure you are running your Python commands from the root directory of the project. This is because the env vars are loaded from the .env
file in the root directory.
If you were not able/don't want to set up Airbyte locally, here's how you would do it with Terraform:
cd automation/terraform
create a terraform.tfvars
file with the following content:
project = "your-project-id"
credentials_file = "path/to/your/credentials.json"
To run:
terraform init
terraform plan
terraform apply
Remember to destroy the resources when you are done
terraform destroy