/data-engineering-zoomcamp-project

Earthquakes API Live Data Project - Data Engineering Zoomcamp Project

Primary LanguagePython

Data Engineering | Zoomcamp Course Project

image

Problem statement

The project aims to build an end-to-end data pipeline that extracts earthquake data periodically (hourly) from USGS API. The extracted data will be processed and enriched with a new geo column (city) that will be extracted from one of the existing columns that have a long address (place) then create desired tables for our dashboard to generate analytics.

There will be two running pipelines (DAG):

  • Hourly_DAG: this DAG will run hourly to extract new data starting from the installation time.
  • Historical_DAG: this DAG will run once to extract the historical earthquake data (2020, 2021, 2022, 2023 till today).

The dashboard will have three parts with control filters on time and city that demonstrate the analytics points below:

  • Historical data analytics:
    • Earthquakes trending with times
    • Earthquakes counts per city
    • Maximum intense earthquakes (mag)
  • Spatial data analytics:
    • World map with earthquakes geolocation
    • Heat world map that shows the earthquakes mags (intense)
  • Last 24 hours analytics:
    • Earthquakes trending with times
    • Earthquakes counts per city
    • Maximum intense earthquakes (mag)

To accelerate queries and data processing, the final table "full_data" has been partitioned by date of earthquakes (column 'time') as this column is one of the filter control in the dashboard also one of the dashboard's sections considers taking the latest date partition only (where the date is equal today) and the table is clustered by geodata (column 'city') which is a filter control in the dashboard too. The original column 'time' type is transformed from string to date type in order to be able to partition by time in spark transformation steps.

image

Data schema

Column Type
time TimestampType
latitude FloatType
longitude FloatType
depth FloatType
mag FloatType
magType StringType
nst FloatType
gap FloatType
dmin FloatType
rms FloatType
net StringType
id StringType
updated TimestampType
place StringType
type FloatType
horizontalError FloatType
depthError FloatType
magError FloatType
magNst FloatType
status StringType
locationSource StringType
magSource StringType

reference

Data Pipeline

  • Full pipeline image

  • Hourly_DAG image

  • Historical_DAG image

Technologies and Tools

Analytics Dashboard

The dashboard will have three parts with control filters on time and city that demonstrate the analytics points below:

  • Historical data analytics:

    • Earthquakes trending with times
    • Earthquakes counts per city
    • Maximum intense earthquakes (mag) image
  • Spatial data analytics:

    • World map with earthquakes geolocation
    • Heat world map that shows the earthquakes mags (intense) image
  • Last 24 hours analytics:

    • Earthquakes trending with times
    • Earthquakes counts per city
    • Maximum intense earthquakes (mag) image

You can check the live dashboard here. (the last 24 hours part of dashboard may not have data if the pipeline is not running live so please choose and filter on one date from historical)

Setup

  1. Setup your google cloud project and service account step1
  2. install terraform on your local machine step2
  3. Setup terraform to create pipeline required infrastructure step3
  4. SSH to your google compute engine VM step4
  5. Clone the repo to your google compute engine VM
    git clone https://github.com/AliaHa3/data-engineering-zoomcamp-project.git
  6. Setup Anaconda + docker + docker-compose
    cd data-engineering-zoomcamp-project
    bash scripts/vm_setup.sh
  7. Update the enviroment variables in below file with your specific project_id and buckets
    cat data-engineering-zoomcamp-project/scripts/setup_config.sh
  8. Setup pipeline docker image (airflow+spark)
    cd data-engineering-zoomcamp-project
    bash scripts/airflow_startup.sh
  9. in Visual Studio code click on ports and forward port 8080
    ForwardPort

go to localhost:8080

and login with (airflow:airflow) for the credentials
AirflowLogin

  1. Enable the historical_DAG and you should see it run. It takes 10-15 minutres to finish
  2. Enable the hourly_DAG
  3. You can check your data in bigquery tables.
  4. if you want to stop docker image you can run below command
    cd data-engineering-zoomcamp-project
    bash scripts/airflow_stop.sh
    or to delete and clean all docker image related file
    cd data-engineering-zoomcamp-project
    bash scripts/airflow_clear.sh

Reference

DataTalks Club
Data Engineering Zoomcamp
MichaelShoemaker's setup steps