Current Situation End To End Data Pipeline

Background

Current Situation is a company that deals with analyzing data on the current job market due to the recent advancement in Artificial Intelligence. To power their job placement and talent matching pool, they are seeking a solution that integrates with their data stored in an on-premises Microsoft SQL Server database.

Solution.

Architecture

Technology used.

Spark SQL - For data transformations
Databricks - for data transformations and warehousing environment.
Azure Blob Storage - Raw data storage
Azure Data Factory - ETL
Microsoft SQL Server - Operational database for transactional systems.

Skills Developed

Data Integration - Azure Data Factory pipeline development.
Big Data Processing - Spark SQL for data transformations.
ETL/ELT Processes - Understanding and implementing Extract, Transform, Load (ETL) or Extract, Load, Transform (ELT) workflows.
Data Visualization - Power BI report creation and dashboard design.
SQL - T-SQL (for MS SQL Server) and Spark SQL.

Use Cases and Justifications

1. MS SQL Server - Operational database for transactional systems

Justification

Robust relational database system for OLTP workloads
Strong integration with Microsoft ecosystem

2. Azure Data Factory - Orchestration and data movement

Justification

Managed ETL service in Azure
Supports various data sources and destinations

3. Azure Blob Storage - Data lake for raw and processed data

Justification

Cost-effective storage for large volumes of unstructured data
Integrates well with other Azure services

4. Databricks - Big data processing and advanced analytics

Justification

Managed Spark environment
Collaborative notebook interface
Supports machine learning workflows

5. Spark SQL - Data transformation and analysis

Justification

SQL interface for Spark, familiar to SQL developers
Distributed processing for large-scale data

6. Power BI - User-friendly interface for creating reports and dashboards

Justification

Strong integration with Azure and Microsoft products
Supports both self-service and enterprise BI

Implementation steps.

Creating a storage account.
Creating a container.
Loading the data from MS SQL Server to Azure Blob Storage using Azure Data Factory.
Data transformation in Databricks.
Visualization in Power BI.

Read the full implementation steps here

Contributing

Contributions to improve the project are welcome!

Samuel-Njoroge/current-situation-data-pipeline

Current Situation End To End Data Pipeline

Background

Solution.

Architecture

Technology used.

Skills Developed

Use Cases and Justifications

Implementation steps.

Contributing