In this repository, I demonstrate some of my learnings on using the popular tool, Apache Airflow to create, schedule and run data pipelines as well as integration with other big data technologies like BigQuery.
There are two options to run the pipelines:
- Clone the repo and install dependencies.
- Follow the step by step setup instructions below to start from scratch.
- Upgrade docker-compose version:2.x.x+
- Allocate memory to docker between 4gb-8gb
- Python: version 3.8+
-
On Linux, the quick-start needs to know your host user-id and needs to have group id set to 0. Otherwise the files created in
dags
,logs
andplugins
will be created with root user. -
Create directory using:
mkdir -p ./dags ./logs ./plugins echo -e "AIRFLOW_UID=$(id -u)" > .env
-
Download or import the docker setup file from airflow's website
curl -LfO 'https://airflow.apache.org/docs/apache-airflow/stable/docker-compose.yaml'
-
Create "Dockerfile" use to build airflow container image.
-
Create GCP PROJECT
-
Create service account: Add Editor and storage admin, storage object admins and bigquery admin
-
Create credential keys and download it
-
Change name and location
cd ~ && mkdir -p ~/.google/credentials/ mv <path/to/your/service-account-authkeys>.json ~/.google/credentials/google_credentials.json
-
Intall gcloud on system : open new terminal and run gcloud-sdk : https://cloud.google.com/sdk/docs/install-sdk
gcloud -v
to see if its installed successfully
-
Set the google applications credentials environment variable
export GOOGLE_APPLICATION_CREDENTIALS="/Users/path/.google/credentials/google_credentials.json"
-
Run gcloud auth application-default login
-
Redirect to the website and authenticate local environment with the cloud environment
- Enable Identity and Access management API
- Enable IAM Service Account Credentials API
-
Add google credentials "GOOGLE_APPLICATION_CREDENTIALS" and project_id and bucket name
GOOGLE_APPLICATION_CREDENTIALS: /.google/credentials/google_credentials.json AIRFLOW_CONN_GOOGLE_CLOUD_DEFAULT: 'google-cloud-platform://?extra__google_cloud_platform__key_path=/.google/credentials/google_credentials.json' GCP_PROJECT_ID: "alt-data-engr" GCP_GCS_BUCKET: "dte-engr-alt"
-
Add the below line to the volumes of the airflow documentation
~/.google/credentials/:/.google/credentials:ro
-
Create requirements.txt inside the airflow folder and add dependencies as in this repo.
-
Create scripts folder inside the airflow folder and inside it create a entrypoint.sh file (paste dependencies from this repo). Now use it to build airflow container image.
-
Run
docker-compose build
6.Run
docker-compose up airflow-init
7.Run
docker-compose up
8.Inside dags/plugins folder create a web folder 9.Inside web folder, create operators folder and copy over the files from this repo.
Airflow setup: https://airflow.apache.org/docs/apache-airflow/stable/start.html
Docker setup instructions: https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html
To restart the project in cases where the docker containers were stopped:
docker compose up airflow-init
docker compose up
By this point you should have enough to run the DAG files by simply copying them over to your local setup. Have fun coding...