Data Chalenge Documentation.
The project is an ELTL pipeline, orchestrated with Apache Airflow inside a Docker Container.
The test is to consume data from an API, persist into a data lake architecture with three layers with the first one being raw data, the second one curated and partitioned by location and the third one having analytical aggregated data.
Clone the project to your desired location:
$ git clone https://github.com/rgualter/open-brewery-db.git
Execute the following command that will create the .env file containig the Airflow UID needed by docker-compose:
$ echo -e "AIRFLOW_UID=$(id -u)" > .env
Build Docker:
$ docker-compose build
Initialize Airflow database:
$ docker-compose up airflow-init
Start Containers:
$ docker-compose up -d
When everything is done, you can check all the containers running:
$ docker ps
Now you can access Airflow web interface by going to http://localhost:8080 with the default user which is in the docker-compose.yml. Username/Password: airflow
With your AWS S3 user and bucket created, you can store your credentials in the connections in Airflow. And we can store which port Spark is exposed when we submit our jobs:
Now, you can see my effort to make the dag work lol.
And finally, check the S3 bucket with all layers.
If you need to make changes or shut down:
$ docker-compose down
- Apache Airflow Documentation
- Spark by examples
- Spark s3 integration
- Airflow and Spark with Docker
- Working with data files from S3 in your local pySpark environment
This project is licensed under the terms of the MIT license.