Create high-quality data pipelines that are dynamic, can be automated, and monitored for efficient operation.
The City of New York would like to develop a Data Analytics platform on Azure Synapse Analytics to accomplish two primary objectives:
- Analyze how the City's financial resources are allocated and how much of the City's budget is being devoted to overtime.
- Make the data available to the interested public to show how the City’s budget is being spent on salary and overtime pay for all municipal employees.
As a Data Engineer, I have to create high-quality data pipelines that are dynamic, can be automated, and monitored for efficient operation. The project team also includes the city’s quality assurance experts who will test the pipelines to find any errors and improve overall data quality.
The source data resides in Azure Data Lake and needs to be processed in a NYC data warehouse in Azure Synapse Analytics. The source datasets consist of CSV files with Employee master data and monthly payroll data entered by various City agencies.
- Azure Data Lake Gen2
- Azure SQL DB
- Azure Data Factory
- Azure Synapse Analytics
Setup Data and Resources in Azure
Create an Azure Data Lake Storage Gen2 (storage account) with three directories in this storage container named
Upload data from the project data to the dirpayrollfiles folder
Upload nycpayroll_2020.csv
file (historical data) from the project data to the dirhistoryfiles folder
Create a table called NYC_Payroll_Data
in db_nycpayroll in the Azure Query Editor with SQL Script:
Create a SQL dedicated pool in the Synapse Analytics workspace
In the SQL dedicated pool, Create master data tables and payroll transaction tables:
The table are successfully created in the SQL Database
In Azure Data Factory, create a linked service to the data lake that contains the data files
If you get a connection error, remember to add the IP address to the firewall settings in SQL DB in the Azure Portal
All the 3 linked services successfully created:
- Select DelimitedText
- Set the path to the nycpayroll_2021.csv in the Data Lake
- Preview the data to make sure it is correctly parsed
1 .In Azure Data Factory, create the data flow to load 2021 Payroll Data to SQL DB transaction table (in the future NYC will load all the transaction data into this table).
-
Create a new pipeline
-
Take a screenshot of the Azure Data Factory screen pipeline run after it has finished.
-
Make sure the data is successfully loaded into the SQL DB table
3. Create data flows to load the data from the data lake files into the Synapse Analytics data tables
-
Create the data flows for loading Employee, Title, and Agency files into corresponding SQL pool tables on Synapse Analytics
-
For each Employee, Title, and Agency file data flow, sink the data into each target Synaspe table
5. Create pipelines for Employee, Title, Agency, and year 2021 Payroll transaction data to Synapse Analytics containing the data flows.
-
Select the dirstaging folder in the data lake storage for staging
-
Optionally you can also create one master pipeline to invoke all the Data Flows
-
Validate and publish the pipelines
In this step, we'll extract the 2021 year data and historical data, merge, aggregate and store it in Synapse Analytics. The aggregation will be on Agency Name, Fiscal Year and TotalPaid.
- Create a data flow level parameter for Fiscal Year
- Add first Source for table_sqldb_nyc_payroll_data table
- Add second Source for the Azure Data Lake history folder
![image](https://user-images.githubusercontent.com/61830624/199210311-893b95af-3382-429b-b44f-ea565676f25d.png)