Demo Overview

AS a Data Engineer, I WANTED TO build a SERVERLESS infrastructure in Azure that simulates the behavior of a real-world big data system, SO THAT it can provide other team members (Data Scientists or Data Analysts) to access the processed data.

Design Flow

Agile Project Management

Trello

Tools

MS Azure
Terraform
Dash Plotly (to do)

Languag

Python
SQL

Azure Set up

Using Terraform to set up Azure Infrastructure and Resources.

Data Lake
Azure Functions
Azure Data Factory
Event Hub
Logic App
Databricks
Cosmos DB
API management
Azure Active Directory
App Registration

Part 1. Data Source

Dummy data for this demo, the reason beings that this demo is to simulate the behavior of a real-world data source, i.e., "IoT data" or "web scraping".

Data Generator in Local: This Python Script to generate dummy JSON object every 10s - used car sales records, saved in bronze data folder.

Data Generator in Azure: This Python Script to generate dummy data, which will be designed into Azure Logic App. It is a batch processing solution, and the data is generated in intervals (defined by the recurrence time)

Part 2. Event Hubs

loading Libs Python

pip install azure-eventhub
pip install azure-identity
pip install aiohttp

Authentication

2.1. ADD (real-world application)

retrieve the resource ID

az servicebus namespace show -g '<your-event-hub-resource-group>' -n '<your-event-hub-name> --query id

assign roles

az role assignment create --assignee "<user@domain>" \
--role "Azure Event Hubs Data Owner" \
--scope "<your-resource-id>"

2.2. Connection String is an easier way to establish and authorise the connection between Event Hubs and another application.

Part 3. Data Lake

NFS 3.0 when creating the storage account

Access Key - found it in the specific Azure Data Lake account, this is an easier solution to mount Databricks to Azure Data Lake

Part 4. Databricks

to run spark job to do `join` and 'transformation'

Mount Data Lake folder to Databricks - this allows the synchronisation so that 'spark job' can access to Data Lake files
Spark on databricks

Define schema: StructType defines the schema for the data frame, with each field specified using the StructField type. This is to keep all columns in an order, and reduce size of the data by re-defining datatypes.
Join job: read and join all JSON format files saved in Data Lakes
Transformation: [Turnover] = [Date removed] - [Date listed]

Part 5. Data Flow Management

raw folder as hot-tier access, designed as a landing zone for all ingested data.
silver folder as a hot-tier access, designed as a processed zone for all processed data.
archive folder as a cold-tier access, to store those raw data which has been processed by Spark job. This can help reduce storage costs while still allowing us to access the JSON objects if needed.

Because Databricks storage doesn't support wildcard search, so it is not ideal to code within the spark job. So, Logic App is here to manage the Data Flow.

#To-do#

Part 6. Cosmos DB (Data Warehouse)

Create an Azure Cosmos DB account
use ADF or Airflow to load silver data from Datalake
check Schema
...

Part 7. API Management

Reference

Data Management life circle enables hot-tier and cold-tier.
Create Create Secret Scope https://mydatabricks-instance#secrets/createScope (change mydatabricks-instance into my instance name)- use this link to create a Secret Scope in Databricks so that the credentials will not expose when mounting Azure Data Lake

raulfanc/Azure-demo