The primary purpose of this ETL pipeline is to reinforce the knowledge acquired through self-study. Drawing on the knowledge and skills developed, the project aims to apply the real-world scenario of building an ETL pipeline.
The scope of the project involves creating an ETL (Extract, Transform, Load) pipeline. This pipeline will extract data from a SQL Server through a control table and then perform transformations using a medallion architecture. The objective is to design the pipeline in such a way that it can handle both full load data and incremental load data. This means that every 24 hours, when the pipeline is executed, it will only load and transform the newly updated data. This approach will not only save costs but also enhance performance.
- Azure Data Factory
- Azure Databricks
- Azure Key Vault
- SQL Servers
- Azure Data Lake Gen 2
There is one control table in sql server, that basically contains all the information about all other table, like Source_Object_Name, Target_Location, Load_Type, Creation_Time, Indicator, ...
- To get all the tables information from control table whose indicator is 1.
- To iterate over each tables inside the control table.
- Inside ForEach, check whether it is full load or incremental load.
- Use LookUp activity to get the max watermark value from the table itself.
- Use Copy activity to copy only the incremented data to raw container in ADLS Gen 2, which is greater then the mentioned watermark value in control table.
- After that use, Stored Procedure activity, to update the latest watermark value in control table.
- Use Copy activity to copy all the data to raw container in ADLS Gen 2.
- After that use, Stored Procedure activity, to update the Indicator value to 1 in control table. So, that in next run, it will not again fetch that data.
- Check Null Values.
- Validate data type and Handling null values.
- Check Duplicates data.
- Categorizing supplier negotiation score and defect quality.
- Prioritizing transportation modes.
- Categorizing product prices.
- Calculating total costs for purchase orders.
- Ranking of suppliers based on the total number of services provided.
- Recommending supplier based on their supplying data.