In this project, we used the Azure Cloud Services to design and orchestrate a data pipeline to perform data engineering operations (Ingestion, Transformation, Analysis, Load) on a Formula 1 Racing Dataset
The data for all the Formula 1 races from 1950s onwards is obtained from an open source API called Ergast Developer API. The structure of the database is shown in the following ER Diagram and explained in the Database User Guide
- Python
- PySpark
- Azure Databricks
- Azure Data Factory (ADF)
- Azure Data Lake Storage Gen2 (ADLS)
- Delta Lake
The solution used in this project is based on the "Modern analytics architecture with Azure Databricks" from the Azure Architecture Center:
- data - contains sample raw data from Ergast API.
- set-up - notebooks to mount ADLS storages (raw, ingested, presentaton) in Databricks.
- raw - contains SQL file to create ingested tables using Spark SQL.
- ingestion - contains notebooks to ingest all the data files from raw layer to ingested layer. Handles the incremental data for files results, pitstopes, laptimes and qualifying.
- trans - contains notebooks to transform the data from ingested layer to presentation layer. Notebook performs transformations to setup for analysis.
- analysis - contains SQL files for finding the dominant drivers and teams and to prepare the results for visualization.
- includes - includes notebooks containing helper functions used in transformations.
- utils - contains SQL file to drop all databases for incremental load.