/azureDataPipelinesNYCPayroll

Data pipelines with Azure Data Factory using Azure datalake Gen 2, Azure SQL database and Azure Synapse

Primary LanguageTSQL

Data Integration Pipelines for NYC Payroll Data Analytics

The City of New York would like to develop a Data Analytics platform on Azure Synapse Analytics to accomplish two primary objectives:

  1. Analyze how the City's financial resources are allocated and how much of the City's budget is being devoted to overtime.
  2. Make the data available to the interested public to show how the City’s budget is being spent on salary and overtime pay for all municipal employees.

This project will involve creating high-quality data pipelines that are dynamic, can be automated, and monitored for efficient operation. The project will aim to develop pipelines that maintain overall data quality.

The source data resides in Azure Data Lake and needs to be processed in a NYC data warehouse in Azure Synapse Analytics. The source datasets consist of CSV files with Employee master data and monthly payroll data entered by various City agencies.

data schema

Summary of Process

  1. Infrastructure was setup: Azure Datalake Gen2, Azure Data Factory, Azure SQL Database and Azure Synapse with a dedicated SQL Server pool.

deployed resources

synapse pool

  1. Data was ingested into Azure Datalake Gen2

payroll files

history file

  1. Linked services were created in Azure Data Factory to the data sources and destinations

linked services

  1. Tables were created in Azure SQL database and Azure synapse. Sql scripts are stored in the sql_scripts folder

  2. Data sets were created in Azure Data Factory

data sets

  1. Data flows were set up in Azure Data Factory

data flows

aggregated data flow

  1. Pipelines were set up.

pipelines

  1. Having set up all pipelines, pipelines were triggered and monitored. Here are the results:

load all data

aggregated data pipeline run

Challenges

  1. Data type errors were resolved by manually setting data types in the Mapping tab of the sink task