Data Engineering | Zoomcamp Course Project
Problem statement
The project aims to build an end-to-end data pipeline that extracts earthquake data periodically (hourly) from USGS API. The extracted data will be processed and enriched with a new geo column (city) that will be extracted from one of the existing columns that have a long address (place) then create desired tables for our dashboard to generate analytics.
There will be two running pipelines (DAG):
- Hourly_DAG: this DAG will run hourly to extract new data starting from the installation time.
- Historical_DAG: this DAG will run once to extract the historical earthquake data (2020, 2021, 2022, 2023 till today).
The dashboard will have three parts with control filters on time and city that demonstrate the analytics points below:
- Historical data analytics:
- Earthquakes trending with times
- Earthquakes counts per city
- Maximum intense earthquakes (mag)
- Spatial data analytics:
- World map with earthquakes geolocation
- Heat world map that shows the earthquakes mags (intense)
- Last 24 hours analytics:
- Earthquakes trending with times
- Earthquakes counts per city
- Maximum intense earthquakes (mag)
To accelerate queries and data processing, the final table "full_data" has been partitioned by date of earthquakes (column 'time') as this column is one of the filter control in the dashboard also one of the dashboard's sections considers taking the latest date partition only (where the date is equal today) and the table is clustered by geodata (column 'city') which is a filter control in the dashboard too. The original column 'time' type is transformed from string to date type in order to be able to partition by time in spark transformation steps.
Data schema
Column | Type |
---|---|
time | TimestampType |
latitude | FloatType |
longitude | FloatType |
depth | FloatType |
mag | FloatType |
magType | StringType |
nst | FloatType |
gap | FloatType |
dmin | FloatType |
rms | FloatType |
net | StringType |
id | StringType |
updated | TimestampType |
place | StringType |
type | FloatType |
horizontalError | FloatType |
depthError | FloatType |
magError | FloatType |
magNst | FloatType |
status | StringType |
locationSource | StringType |
magSource | StringType |
Data Pipeline
Technologies and Tools
- Cloud - Google Cloud Platform
- Infrastructure as Code software (IaC) - Terraform
- Containerization - Docker, Docker Compose
- Workflow Orchestration - Airflow
- Batch processing - Apache Spark, PySpark
- Data Lake - Google Cloud Storage
- Data Warehouse - BigQuery
- Data Visualization - Looker Studio (Google Data Studio)
- Language - Python
Analytics Dashboard
The dashboard will have three parts with control filters on time and city that demonstrate the analytics points below:
-
Historical data analytics:
-
Spatial data analytics:
-
Last 24 hours analytics:
You can check the live dashboard here. (the last 24 hours part of dashboard may not have data if the pipeline is not running live so please choose and filter on one date from historical)
Setup
- Setup your google cloud project and service account step1
- install terraform on your local machine step2
- Setup terraform to create pipeline required infrastructure step3
- SSH to your google compute engine VM step4
- Clone the repo to your google compute engine VM
git clone https://github.com/AliaHa3/data-engineering-zoomcamp-project.git
- Setup Anaconda + docker + docker-compose
cd data-engineering-zoomcamp-project bash scripts/vm_setup.sh
- Update the enviroment variables in below file with your specific project_id and buckets
cat data-engineering-zoomcamp-project/scripts/setup_config.sh
- Setup pipeline docker image (airflow+spark)
cd data-engineering-zoomcamp-project bash scripts/airflow_startup.sh
- in Visual Studio code click on ports and forward port 8080
go to localhost:8080
and login with (airflow:airflow) for the credentials
- Enable the historical_DAG and you should see it run. It takes 10-15 minutres to finish
- Enable the hourly_DAG
- You can check your data in bigquery tables.
- if you want to stop docker image you can run below command
or to delete and clean all docker image related file
cd data-engineering-zoomcamp-project bash scripts/airflow_stop.sh
cd data-engineering-zoomcamp-project bash scripts/airflow_clear.sh
Reference
DataTalks Club
Data Engineering Zoomcamp
MichaelShoemaker's setup steps