Data Pipelines with Airflow

Scope

The project scope is to extract data from S3 containers containing logs about songs and webpage usage, move the data into a RedShift database first into staging tables and from there into dimensional and fact tables using Apache Airflow as orchestrator

Execution

project_airflow This is a simple visualization of the project, with the flow of the data from the S3 containers to the different tables into the Redshift database.

The DAG used in Airflow with all the tasks to achieve the process is:

dag

Input data

The input data consists of two S3 containers containing two different datasets

Song Dataset

The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.

song_data/A/B/C/TRABCEI128F424C983.json song_data/A/A/B/TRAABJL12903CDCF1A.json

And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like.

{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}

Log Dataset

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate app activity logs from an imaginary music streaming app based on configuration settings.

The log files in the dataset you'll be working with are partitioned by year and month. For example, here are filepaths to two files in this dataset.

log_data/2018/11/2018-11-12-events.json log_data/2018/11/2018-11-13-events.json

And below is an example of what the data in a log file, 2018-11-12-events.json, looks like. log-data

RedShift database

The redshift database contains two staging tables s_events and s_songs that contains all the data extracted from the previous datasets. Then we have 4 dimensional table users, songs, artists and time and a fact table songplays. This is the ER schema of these tables: schema

Launch the project

To execute the project a valid airflow installation is needed with valid AWS credentials inserted into the Apache airflow connection:

Conn Id: aws_credentials
Conn Type: Amazon Web Services
Login: Access key ID from a valid IAM User credentials
Password: Secret access key from a valid IAM User credentials

Conn Id: redshift
Conn Type: Postgres
Host: The endpoint of your Redshift cluster, excluding the port at the end
Schema: This is the Redshift database you want to connect to
Login: User that have access to the database
Password: Password for the database
Port: Port of the database (default 5439)