Data Engineering | Zoomcamp Course Project

Problem statement

The project aims to build an end-to-end data pipeline that extracts earthquake data periodically (hourly) from USGS API. The extracted data will be processed and enriched with a new geo column (city) that will be extracted from one of the existing columns that have a long address (place) then create desired tables for our dashboard to generate analytics.

There will be two running pipelines (DAG):

Hourly_DAG: this DAG will run hourly to extract new data starting from the installation time.
Historical_DAG: this DAG will run once to extract the historical earthquake data (2020, 2021, 2022, 2023 till today).

The dashboard will have three parts with control filters on time and city that demonstrate the analytics points below:

Historical data analytics:
- Earthquakes trending with times
- Earthquakes counts per city
- Maximum intense earthquakes (mag)
Spatial data analytics:
- World map with earthquakes geolocation
- Heat world map that shows the earthquakes mags (intense)
Last 24 hours analytics:
- Earthquakes trending with times
- Earthquakes counts per city
- Maximum intense earthquakes (mag)

To accelerate queries and data processing, the final table "full_data" has been partitioned by date of earthquakes (column 'time') as this column is one of the filter control in the dashboard also one of the dashboard's sections considers taking the latest date partition only (where the date is equal today) and the table is clustered by geodata (column 'city') which is a filter control in the dashboard too. The original column 'time' type is transformed from string to date type in order to be able to partition by time in spark transformation steps.

Data schema

Column	Type
time	TimestampType
latitude	FloatType
longitude	FloatType
depth	FloatType
mag	FloatType
magType	StringType
nst	FloatType
gap	FloatType
dmin	FloatType
rms	FloatType
net	StringType
id	StringType
updated	TimestampType
place	StringType
type	FloatType
horizontalError	FloatType
depthError	FloatType
magError	FloatType
magNst	FloatType
status	StringType
locationSource	StringType
magSource	StringType

reference

Data Pipeline

Full pipeline
Hourly_DAG
Historical_DAG

Technologies and Tools

Cloud - Google Cloud Platform
Infrastructure as Code software (IaC) - Terraform
Containerization - Docker, Docker Compose
Workflow Orchestration - Airflow
Batch processing - Apache Spark, PySpark
Data Lake - Google Cloud Storage
Data Warehouse - BigQuery
Data Visualization - Looker Studio (Google Data Studio)
Language - Python

Analytics Dashboard

The dashboard will have three parts with control filters on time and city that demonstrate the analytics points below:

Historical data analytics:
- Earthquakes trending with times
- Earthquakes counts per city
- Maximum intense earthquakes (mag)
Spatial data analytics:
- World map with earthquakes geolocation
- Heat world map that shows the earthquakes mags (intense)
Last 24 hours analytics:
- Earthquakes trending with times
- Earthquakes counts per city
- Maximum intense earthquakes (mag)

You can check the live dashboard here. (the last 24 hours part of dashboard may not have data if the pipeline is not running live so please choose and filter on one date from historical)

Setup

Setup your google cloud project and service account step1
install terraform on your local machine step2
Setup terraform to create pipeline required infrastructure step3
SSH to your google compute engine VM step4

Clone the repo to your google compute engine VM

git clone https://github.com/AliaHa3/data-engineering-zoomcamp-project.git

Setup Anaconda + docker + docker-compose

cd data-engineering-zoomcamp-project
bash scripts/vm_setup.sh

Update the enviroment variables in below file with your specific project_id and buckets
```
cat data-engineering-zoomcamp-project/scripts/setup_config.sh
```

Setup pipeline docker image (airflow+spark)

cd data-engineering-zoomcamp-project
bash scripts/airflow_startup.sh

in Visual Studio code click on ports and forward port 8080

go to localhost:8080

and login with (airflow:airflow) for the credentials

Enable the historical_DAG and you should see it run. It takes 10-15 minutres to finish
Enable the hourly_DAG
You can check your data in bigquery tables.

if you want to stop docker image you can run below command

cd data-engineering-zoomcamp-project
bash scripts/airflow_stop.sh

or to delete and clean all docker image related file

cd data-engineering-zoomcamp-project
bash scripts/airflow_clear.sh

Reference

DataTalks Club
Data Engineering Zoomcamp
MichaelShoemaker's setup steps

alrah003/data-engineering-zoomcamp-project