⚙️ Data Engineering Project ⚙️

🌪️🌪️ Hurricane Harvey Tweets and Satelite Images - Azure Data Pipelines and Data Visualization 🌪️🌪️

Introduction & Goals

Main goals:

Engineer data streaming pipeline on Azure with a main purpose to ingest and process tweets and satelite images data from Hurricane Harvey natural disaster, and serve Power BI report.

This is a meta repository that contains documentation and links to two subfolders in this repository, each of them having a distinct purpose:

  1. hurricane-proc-send-data. Pre-processing of tweets about the hurricane harvey events, combining it with satelite images of the building s with and without damage and simulating a streaming data source by building a python program that sends requests to a Azure API endpoint (#TODO fire CLI)

  2. hurricane-streaming-az-funcs Azure data streaming pipeline that:

    • Ingests tweets from the local source client via Azure API management having a Azure Function as backend
    • Utilizes Azure Event Hub as a message queue service
    • Azure Function that takes messages from Azure Event Hub and writes them to Azure Cosmos Database

Tools:

Data

  1. Hurricane Harvey Tweets from Kaggle.

Tweets containing Hurricane Harvey from the morning of 8/25/2017. I hope to keep this updated if computer problems do not persist.

*8/30 Update This update includes the most recent tweets tagged "Tropical Storm Harvey", which spans from 8/20 to 8/30 as well as the properly merged version of dataset including Tweets from when Harvey before it was downgraded back to a tropical storm.

  1. Satellite Images of Hurricane Damage from Kaggle.

Overview The data are satellite images from Texas after Hurricane Harvey divided into two groups (damage and no_damage). The goal is to make a model which can automatically identify if a given region is likely to contain flooding damage.

Source Data originally taken from: https://ieee-dataport.org/open-access/detecting-damaged-buildings-post-hurricane-satellite-imagery-based-customized and can be cited with http://dx.doi.org/10.21227/sdad-1e56 and the original paper is here: https://arxiv.org/abs/1807.01688

Used Tools

Connect

  • Azure API Management

    API Management (APIM) is a way to create consistent and modern API gateways for existing back-end services.

Buffer

  • Azure Event Hubs

    Azure Event Hubs is a big data streaming platform and event ingestion service. It can receive and process millions of events per second. Data sent to an event hub can be transformed and stored by using any real-time analytics provider or batching/storage adapters.

Processing

  • Azure Function

    Azure Functions is a serverless solution that allows you to write less code, maintain less infrastructure, and save on costs. Instead of worrying about deploying and maintaining servers, the cloud infrastructure provides all the up-to-date resources needed to keep your applications running.

Storage

  • Azure Blob Storage

    Azure Blob storage is Microsoft's object storage solution for the cloud. Blob storage is optimized for storing massive amounts of unstructured data. Unstructured data is data that doesn't adhere to a particular data model or definition, such as text or binary data.

  • Azure Cosmos DB - SQL Core - Document Store

    Azure Cosmos DB is a fully managed NoSQL database for modern app development. Single-digit millisecond response times, and automatic and instant scalability, guarantee speed at any scale. Business continuity is assured with SLA-backed availability and enterprise-grade security.

Visualization

Author: 👤 Kristijan Bakaric

Follow Me On

Show your support

Give a ⭐️ if this project helped you!

Markdown Cheat Sheet