This project process semi-structured data and build a datalake that provides efficient storage and performance. The datalake is organized in the following 2 layers:
- raw layer: datasets must have the same schema as the source, but support fast structured data reading
- trusted layer: Datamarts as required by the analysis team
The Datamarts required in trusted layer should be built as the following rules:
- Order dataset: one line per order with all data from order, consumer, restaurant and the LAST status from order statuses dataset. To help analysis, it would be a nice to have: data partitioned on the restaurant LOCAL date.
- Order Items dataset: easy to read dataset with one-to-many relationship with Order dataset. Must contain all data from order items column.
- Order statuses: Dataset containing one line per order with the timestamp for each registered event: CONCLUDED, REGISTERED, CANCELLED, PLACED.
For the trusted layer, anonymize any sensitive data.
At the end of each ETL, use any appropriated methods to validate your data. Read performance, watch out for small files and skewed data.
Non functional requirements
- Data volume increases each day.
- All ETLs must be built to be scalable.
First of all, install the following softwares to reproduce this project:
So, you can init the kubernetes environment with Airflow and Spark in the following commands:
make create-cluster
make create-namespace
make add-charts
make helm-init
To forward the Airflow Web UI to your browser, open another terminal and run:
make airflow-forward
To forward the Spark Web UI to your browser, open another terminal and run:
make spark-forward
To release the Airflow or Spark images of this project, run the release targets with a VERSION as parameter:
make airflow-release VERSION=0.0.1
make spark-release VERSION=0.0.1
To update the code files in the kubernetes environment, run the following:
make update
To wrap all these commands and start your first version, just run:
make all
To clear your environment and remove the kubernetes cluster, run the following:
make clear
The AWS credentials are registered in the charts by environment variables, so to achieve it with Apache Airflow Chart complete the env
section in templates/airflow/airflow-chart.yaml
as the following example:
env:
- name: AWS_ACCESS_KEY_ID
value: "MYACCESSKEYID1234"
- name: AWS_SECRET_ACCESS_KEY
value: "MY/SECRETACCESSKEY1234"
- name: AIRFLOW_CONN_SPARK_DEFAULT
value: "spark://spark%3A%2F%2Fifood-spark-master-svc:7077"
To make the same in the Apache Spark pods, create a section to master and workers in the templates/spark/spark-chart.yaml
file as the following:
master:
extraEnvVars:
- name: AWS_ACCESS_KEY_ID
value: "MYACCESSKEYID1234"
- name: AWS_SECRET_ACCESS_KEY
value: "MY/SECRETACCESSKEY1234"
worker:
extraEnvVars:
- name: AWS_ACCESS_KEY_ID
value: "MYACCESSKEYID1234"
- name: AWS_SECRET_ACCESS_KEY
value: "MY/SECRETACCESSKEY1234"
Both Apache Airflow and Apache Spark environments are deployed in Kubernetes using Apache Airflow Official Helm Chart and Apache Spark Bitnami Helm Chart. While running on local environment with kind, Apache Airflow architecture is deployed with only one worker and Apache Spark is deployed with three workers. A shared volume is created to sync code between the pods and the local folders.
Two DAGs are created to orchestrate the ETL jobs: ifood-ingestion and ifood-datamart. The DAG ifood-ingestion is responsible for the ingestion of s3://ifood-data-architecture-source
files to the raw layer and make sure that each row was ingested to the data lake stored in s3://ifood-lake/raw
. Then, the DAG ifood-datamart gets data from the raw layer and creates the specified datamarts in s3://ifood-lake/trusted
. Both DAGs are scheduled to run yearly by default, besides each job could be configured to run with any schedule. Raw and trusted layers are writed with Delta Lake format in upsert mode. After the ETL tasks in ifood-ingestion a new task is submited to check the number of rows of the ID columns in the source and raw layer. The file spark/src/schemas.yaml
stores the dtypes of each column in raw layer. The file airflow/dags/config.yaml
provides the path to the source files and a few other parameters to sync the data in the raw layer.
The folder airflow/
keeps the Apache Airflow Dockerfile, DAG files and other requirements to be deployed in Kubernetes. A similar pattern is used in the spark/
folder, where the PySpark script are placed with a Dockerfile and their requirements. In templates/
the Helm chart configs and other pod declarations are stored.