/uber-eats-airflow-spark-glue-athena

Ingest CSV files and load them to S3, upload Spark script to S3, run the Spark code on EMR cluster, which will pull the raw UberEats data from S3, clean the data, and load them back to S3 in the proper schema. All of this orchestrated with Airflow

Primary LanguagePython

Overview

Architecture

Tech Stack

  • AWS Glue Data Catalog
  • AWS Glue Crawler
  • AWS EMR
  • AWS EC2
  • Apache Spark
  • Airflow
  • Amazon S3
  • Amazon Athena
  • SQL
  • Python

Project Overview

In this project, we created an entire workflow orchestrated with Airflow. The workflow involves:

  • Uploading CSV files and spark script on S3.
  • Creating an EMR Cluster to execute the Spark job, which cleans the data and loads it into another S3 bucket in Avro format with the appropriate data model.
  • Creating a Glue Crawler and a Data Catalog to facilitate querying the resulting data with AWS Athena.
  • Querying the resulting tables with AWS Athena.

Data Model

Data Model

Airflow DAG

Airflow DAG

Analytics

Let's answer some questions to understand our data:

  1. Most popular menu category in terms of the total number of menus: Most Popular Menu P.S: I like sandwiches, and you? :)

  2. What are the restaurants with the most total number of menus: Total Menu per Restaurant

  3. Which city has the most number of restaurants: Restaurants by City Ohhh! Houston has the most number of restaurants.

  4. Which restaurants are available in the most number of cities: Franchise