/airflow_data_pipeline

Airflow data pipeline for Udacity DEND

Primary LanguagePython

Data Pipeline with Airflow

Motivation

The music streaming company where we work, Sparkify, has decided it is time to introduce more automation and monitoring to their data warehouse ETL pipelines and come to the conclusion that the best tool to achieve this is Apache Airflow.

As their data engineer on this project, I have been tasked to create high grade data pipelines that are dynamic and built from reusable tasks, can be monitored, and allow easy backfills. Since data quality plays a big part when analyses are executed on top the data warehouse, there is the expectation that the ETL process be designed with some data quality checks built in, so that any discrepancies can be caught early and corrected.

The source data resides in S3 and needs to be processed in Sparkify's data warehouse in Amazon Redshift. The source datasets consist of JSON logs that tell about user activity in the application and JSON metadata about the songs the users listen to.

Skills/technologies being tested

  • Knowledge of data modeling and ETL
  • Knowledge of SQL (in a limited way)
  • Knowledge of AWS and Redshift
  • Knowledge of Airflow
  • Writing clean, modular, and well-documented Python code

The Datasets

I'll be working with two datasets that reside in Udacity's S3 buckets:

* Song data: s3://udacity-dend/song_data
* Log data: s3://udacity-dend/log_data

The song dataset (json files) is based on the Million Song Dataset, while the log dataset is generated by an event simulator based on the songs data

Deployment

I am runnning an instance of Airflow in a docker container for this project. The container is based on this docker image

  • Pull the docker image from docker hub
  • Start a container off of the image, mounting the volumes containing the dags and plugins with the syntax -v $(pwd)/plugins/:/usr/local/airflow/plugins
  • Install boto3 to the container by running docker container exec -it CONTAINER pip install boto3
  • Create a redshift cluster programmatically or using the AWS console
  • Run the create_tables.sql file to create the tables required in Redshift
  • Access the Airflow web UI by connecting to the specified port, typically: localhost:8080
  • Configure connections to AWS and your redshift cluster using connection variables in the Airflow web UI
  • Run the DAG from the Airflow web UI

References