Data Pipeline with Airflow

Motivation

The music streaming company where we work, Sparkify, has decided it is time to introduce more automation and monitoring to their data warehouse ETL pipelines and come to the conclusion that the best tool to achieve this is Apache Airflow.

As their data engineer on this project, I have been tasked to create high grade data pipelines that are dynamic and built from reusable tasks, can be monitored, and allow easy backfills. Since data quality plays a big part when analyses are executed on top the data warehouse, there is the expectation that the ETL process be designed with some data quality checks built in, so that any discrepancies can be caught early and corrected.

The source data resides in S3 and needs to be processed in Sparkify's data warehouse in Amazon Redshift. The source datasets consist of JSON logs that tell about user activity in the application and JSON metadata about the songs the users listen to.

Skills/technologies being tested

Knowledge of data modeling and ETL
Knowledge of SQL (in a limited way)
Knowledge of AWS and Redshift
Knowledge of Airflow
Writing clean, modular, and well-documented Python code

The Datasets

I'll be working with two datasets that reside in Udacity's S3 buckets:

* Song data: s3://udacity-dend/song_data
* Log data: s3://udacity-dend/log_data

The song dataset (json files) is based on the Million Song Dataset, while the log dataset is generated by an event simulator based on the songs data

Deployment

I am runnning an instance of Airflow in a docker container for this project. The container is based on this docker image

Pull the docker image from docker hub
Start a container off of the image, mounting the volumes containing the dags and plugins with the syntax -v $(pwd)/plugins/:/usr/local/airflow/plugins
Install boto3 to the container by running docker container exec -it CONTAINER pip install boto3
Create a redshift cluster programmatically or using the AWS console
Run the create_tables.sql file to create the tables required in Redshift
Access the Airflow web UI by connecting to the specified port, typically: localhost:8080
Configure connections to AWS and your redshift cluster using connection variables in the Airflow web UI
Run the DAG from the Airflow web UI

ezebunandu/airflow_data_pipeline

Data Pipeline with Airflow

Motivation

Skills/technologies being tested

The Datasets

Deployment

References