AB-InBev Data Challenge

About

Data Chalenge Documentation.

The project is an ELTL pipeline, orchestrated with Apache Airflow inside a Docker Container.

The test is to consume data from an API, persist into a data lake architecture with three layers with the first one being raw data, the second one curated and partitioned by location and the third one having analytical aggregated data.

Architecture

Prerequisites

Setup

Clone the project to your desired location:

$ git clone https://github.com/rgualter/open-brewery-db.git

Execute the following command that will create the .env file containig the Airflow UID needed by docker-compose:

$ echo -e "AIRFLOW_UID=$(id -u)" > .env

Build Docker:

$ docker-compose build

Initialize Airflow database:

$ docker-compose up airflow-init

Start Containers:

$ docker-compose up -d

When everything is done, you can check all the containers running:

$ docker ps

Airflow Interface

Now you can access Airflow web interface by going to http://localhost:8080 with the default user which is in the docker-compose.yml. Username/Password: airflow

With your AWS S3 user and bucket created, you can store your credentials in the connections in Airflow. And we can store which port Spark is exposed when we submit our jobs:

Now, you can see my effort to make the dag work lol.

And finally, check the S3 bucket with all layers.

Shut down or restart Airflow

If you need to make changes or shut down:

$ docker-compose down

References

License

This project is licensed under the terms of the MIT license.