Data Pipelines with Airflow

Project Introduction

A music streaming company, Sparkify, has decided that it is time to introduce more automation and monitoring to their data warehouse ETL pipelines and come to the conclusion that the best tool to achieve this is Apache Airflow.

They have decided to bring you into the project and expect you to create high grade data pipelines that are dynamic and built from reusable tasks, can be monitored, and allow easy backfills. They have also noted that the data quality plays a big part when analyses are executed on top the data warehouse and want to run tests against their datasets after the ETL steps have been executed to catch any discrepancies in the datasets.

The source data resides in S3 and needs to be processed in Sparkify's data warehouse in Amazon Redshift. The source datasets consist of JSON logs that tell about user activity in the application and JSON metadata about the songs the users listen to.

Prerequisites

AWS user credentials (access key and secret access key).
Redshift Cluster running with S3 access permissions and port 5439 open.
Docker and docker-compose.
Apache Airflow.

Services

The table below lists the services used and the function of each one.

Services	Description
AWS IAM	Creates access credentials and roles for services to communicate with other services.
AWS S3	Stores log and song data.
AWS Redshift	Creates the staging area and Data Warehouse.

Setup Environment

To run locally you need to have docker-compose installed - Running Airflow in Docker.

Run the following commands in the terminal.

Clone git repository

git clone https://github.com/gabrielpedrosati/sparkify-data-pipeline-airflow.git

Start Airflow

docker-compose up

In the browser go to localhost:8080. Type airflow for username and password.

Create connections for Redshift access with name "redshift_conn_id" and for AWS access with id "aws_credentials_id". Redshift Connection and AWS Connection.

Then just start the DAG and wait until the next hour or you can use the manual trigger function.

Data

Log Data

Consists of files in JSON format generated by app activity logs from an music streaming app.

{
  "artist": null,
  "auth": "Logged In",
  "firstName": "Walter",
  "gender": "M",
  "itemInSession": 0,
  "lastName": "Frye",
  "length": null,
  "level": "free",
  "location": "San Francisco-Oakland-Hayward, CA",
  "method": "GET",
  "page": "Home",
  "registration": 1540919166796,
  "sessionId": 38,
  "song": null,
  "status": 200,
  "ts": 1541105830796,
  "userAgent": "\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36\"",
  "userId": "39"
}

Song Data

Each file is in JSON format and contains metadata about a song and the artist of that song.

{
  "artist_id": "ARJNIUY12298900C91",
  "artist_latitude": null,
  "artist_location": "",
  "artist_longitude": null,
  "artist_name": "Adelitas Way",
  "duration": 213.9424,
  "num_songs": 1,
  "song_id": "SOBLFFE12AF72AA5BA",
  "title": "Scream",
  "year": 2009
}

Data Pipeline

The image below represents all the stages of the project as well as the dependencies between the tasks.