/Azure-demo

Used Car Sales Data Real-Time Architecture

Primary LanguageJupyter Notebook

Demo Overview

AS a Data Engineer, I WANTED TO build a SERVERLESS infrastructure in Azure that simulates the behavior of a real-world big data system, SO THAT it can provide other team members (Data Scientists or Data Analysts) to access the processed data.

Design Flow

Design Flow


Agile Project Management

Tools

  • MS Azure
  • Terraform
  • Dash Plotly (to do)

Languag

  • Python
  • SQL

Azure Set up

Using Terraform to set up Azure Infrastructure and Resources.

  1. Data Lake
  2. Azure Functions
  3. Azure Data Factory
  4. Event Hub
  5. Logic App
  6. Databricks
  7. Cosmos DB
  8. API management
  9. Azure Active Directory
  10. App Registration

Part 1. Data Source

Dummy data for this demo, the reason beings that this demo is to simulate the behavior of a real-world data source, i.e., "IoT data" or "web scraping".

Data Generator in Local: This Python Script to generate dummy JSON object every 10s - used car sales records, saved in bronze data folder.

Data Generator in Azure: This Python Script to generate dummy data, which will be designed into Azure Logic App. It is a batch processing solution, and the data is generated in intervals (defined by the recurrence time)

Streaming Ingested Data


Part 2. Event Hubs

  1. loading Libs Python
pip install azure-eventhub
pip install azure-identity
pip install aiohttp
  1. Authentication

2.1. ADD (real-world application)

retrieve the resource ID

az servicebus namespace show -g '<your-event-hub-resource-group>' -n '<your-event-hub-name> --query id

assign roles

az role assignment create --assignee "<user@domain>" \
--role "Azure Event Hubs Data Owner" \
--scope "<your-resource-id>"

2.2. Connection String is an easier way to establish and authorise the connection between Event Hubs and another application.


Part 3. Data Lake

  1. NFS 3.0 when creating the storage account
  • Access Key - found it in the specific Azure Data Lake account, this is an easier solution to mount Databricks to Azure Data Lake

Part 4. Databricks

to run spark job to do `join` and 'transformation'
  1. Define schema: StructType defines the schema for the data frame, with each field specified using the StructField type. This is to keep all columns in an order, and reduce size of the data by re-defining datatypes.
  2. Join job: read and join all JSON format files saved in Data Lakes
  3. Transformation: [Turnover] = [Date removed] - [Date listed]

Part 5. Data Flow Management

  1. raw folder as hot-tier access, designed as a landing zone for all ingested data.
  2. silver folder as a hot-tier access, designed as a processed zone for all processed data.
  3. archive folder as a cold-tier access, to store those raw data which has been processed by Spark job. This can help reduce storage costs while still allowing us to access the JSON objects if needed.

Because Databricks storage doesn't support wildcard search, so it is not ideal to code within the spark job. So, Logic App is here to manage the Data Flow. Data Flow Management




#To-do#

Part 6. Cosmos DB (Data Warehouse)

  • Create an Azure Cosmos DB account
  • use ADF or Airflow to load silver data from Datalake
  • check Schema
  • ...

Part 7. API Management


Reference