A music streaming company, Sparkify, has decided that it is time to introduce more automation and monitoring to their data warehouse ETL pipelines and come to the conclusion that the best tool to achieve this is Apache Airflow.
They have decided to bring you into the project and expect you to create high grade data pipelines that are dynamic and built from reusable tasks, can be monitored, and allow easy backfills. They have also noted that the data quality plays a big part when analyses are executed on top the data warehouse and want to run tests against their datasets after the ETL steps have been executed to catch any discrepancies in the datasets.
The source data resides in S3 and needs to be processed in Sparkify's data warehouse in Amazon Redshift. The source datasets consist of JSON logs that tell about user activity in the application and JSON metadata about the songs the users listen to.
- AWS user credentials (access key and secret access key).
- Redshift Cluster running with S3 access permissions and port 5439 open.
- Docker and docker-compose.
- Apache Airflow.
The table below lists the services used and the function of each one.
Services | Description |
---|---|
AWS IAM | Creates access credentials and roles for services to communicate with other services. |
AWS S3 | Stores log and song data. |
AWS Redshift | Creates the staging area and Data Warehouse. |
To run locally you need to have docker-compose installed - Running Airflow in Docker.
Run the following commands in the terminal.
Clone git repository
git clone https://github.com/gabrielpedrosati/sparkify-data-pipeline-airflow.git
Start Airflow
docker-compose up
In the browser go to localhost:8080. Type airflow
for username and password.
Create connections for Redshift access with name "redshift_conn_id" and for AWS access with id "aws_credentials_id". Redshift Connection and AWS Connection.
Then just start the DAG and wait until the next hour or you can use the manual trigger function.
Consists of files in JSON format generated by app activity logs from an music streaming app.
{
"artist": null,
"auth": "Logged In",
"firstName": "Walter",
"gender": "M",
"itemInSession": 0,
"lastName": "Frye",
"length": null,
"level": "free",
"location": "San Francisco-Oakland-Hayward, CA",
"method": "GET",
"page": "Home",
"registration": 1540919166796,
"sessionId": 38,
"song": null,
"status": 200,
"ts": 1541105830796,
"userAgent": "\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36\"",
"userId": "39"
}
Each file is in JSON format and contains metadata about a song and the artist of that song.
{
"artist_id": "ARJNIUY12298900C91",
"artist_latitude": null,
"artist_location": "",
"artist_longitude": null,
"artist_name": "Adelitas Way",
"duration": 213.9424,
"num_songs": 1,
"song_id": "SOBLFFE12AF72AA5BA",
"title": "Scream",
"year": 2009
}
The image below represents all the stages of the project as well as the dependencies between the tasks.