/Azure_Databricks_Formula1

Data Engineering Project on Formula1 Racing – Using Azure Data Factory and Databricks.

Primary LanguageJupyter Notebook

Azure Databricks Formula1

Concept of the Project 💡

  • This project involves the acquisition of Formula1 Datasets from the Ergast API. The transformations on these datasets are subsequently processed in 3 layers, i.e., Bronze -> Silver -> Gold. The transformations are executed using Databricks. The resultant data of each transformation is loaded into DELTA Lake with the intention of enabling the Analytics team to draw meaningful and practical insights from these datasets. The primary objective is to comprehensively understand the workings of Databricks.

Task 🎯

  • The mission of this project is to transform the Bronze data (i.e., Raw data) of different formats into Silver data (i.e., Ingested data) in columnar format (i.e., Parquet), and then into Gold data (i.e., Presentation data) using PySpark in Databricks.

Source Data: 📤

Destination: 📥📍

  • Azure Data Lake Gen2 Storage

Tools ⚙

Jobs Run

  • ADF Pipeline

Transformation

  • Databricks

Approach

Environment Setup

  • Azure Subscription
  • Data Factory
  • Data Lake Storage Gen2
  • Azure Key vaults
  • Azure Databricks Cluster

Architecture Overview

Architecture

Pipeline Steps:

  1. Create a Linked Service To Azure Databricks
  2. Create a Linked Service To Azure Data Lake storage (GEN2)
  3. Create 1st Pipeline:
  • Check metadata exists before executing the ingestion notebooks using the IF Condition
  1. Create 2nd Pipeline:
  • Execute trans/1.race_results.ipynb first, then link trans/2.driver_standings.ipynb and trans/3.constructor_standings.ipynb on success.
  1. Create 3rd Pipeline:
  • Create dependend execution of 1st to 2nd pipeline
  • Finally execute the notebooks
  1. Create Tumbling window trigger scope

image

image

image

Used Technologies

  • Azure DataFactory
  • Azure Databricks (Pyspark)
  • Azure Storage Account
  • Azure Data Lake Gen2
  • Azure Key vaults