Batch Data Processing with Airflow , Python and Spark (AWS EMR)

Design

Architecture

DAG Design

Data

Insert the following csv into your sql

aws s3 cp s3://start-data-engg/data.zip ./
unzip data.zip

Prerequisites

  1. [docker] (With WSL2 as backend)
  2. [AWS_account] (With AWS CLI configured)
  3. [MySQL] (Allow data export)

Credit

This repository adpots technique from the following blog. How to submit Spark jobs to EMR cluster from Airflow.