- This project involves the acquisition of Formula1 Datasets from the Ergast API. The transformations on these datasets are subsequently processed in 3 layers, i.e., Bronze -> Silver -> Gold. The transformations are executed using Databricks. The resultant data of each transformation is loaded into DELTA Lake with the intention of enabling the Analytics team to draw meaningful and practical insights from these datasets. The primary objective is to comprehensively understand the workings of Databricks.
- The mission of this project is to transform the Bronze data (i.e., Raw data) of different formats into Silver data (i.e., Ingested data) in columnar format (i.e., Parquet), and then into Gold data (i.e., Presentation data) using PySpark in Databricks.
- Ergast (https://ergast.com/mrd/)
- I have manually ingested these datasets in different format into Datalake Gen2; Datasets
- Azure Data Lake Gen2 Storage
- ADF Pipeline
- Databricks
- Azure Subscription
- Data Factory
- Data Lake Storage Gen2
- Azure Key vaults
- Azure Databricks Cluster
- Create a Linked Service To Azure Databricks
- Create a Linked Service To Azure Data Lake storage (GEN2)
- Create 1st Pipeline:
- Check metadata exists before executing the ingestion notebooks using the IF Condition
- Create 2nd Pipeline:
- Execute trans/1.race_results.ipynb first, then link trans/2.driver_standings.ipynb and trans/3.constructor_standings.ipynb on success.
- Create 3rd Pipeline:
- Create dependend execution of 1st to 2nd pipeline
- Finally execute the notebooks
- Create Tumbling window trigger scope
- Azure DataFactory
- Azure Databricks (Pyspark)
- Azure Storage Account
- Azure Data Lake Gen2
- Azure Key vaults