Earthquake Data Engineering Capstone Project

This project automatically ingests, stores, transforms the latest seismic activity data from the USGS (United States Geologic Society) for later analysis.

This GitHub repository fulfills the final capstone project for the Data Engineering Zoomcamp by DataTalks.Club.

Dashboard

Earthquake Capstone Project Google Looker Studio

Technical Challenge

Analysts are studying the frequency, intensity, and spatial occurrence of seismic activity. This data needs to be automatically stored and processed in a way that analysts can quickly analyze and build out reports and dashboards. The technical implementation needs to be:

Automated
Reliable
Scalable
Affordable

Technology Utilized

Infrastructure as code (IaC): Terraform
Workflow orchestration: Airflow
Containerization: Docker
Data Lake: Google Cloud Storage (GCS)
Data Warehouse: BigQuery
Transformations: dbt
Visualization: Google Data Studio / Looker Studio

Dataset

The data comes from access of a public REST API from the USGS in collaboration with the International Federation of Digital Seismograph Networks (FDSN). The data is accessed through the query API endpoint, which has 3 parameters:

format={geojson}
starttime={yyyy-mm-dd}
endtime={yyyy-mm-dd}

https://earthquake.usgs.gov/fdsnws/event/1/query?format={format}&starttime={yyyy-mm-dd}&endtime={yyyy-mm-dd}

The geojson data format was used in this project.

Example of json response:

Data Transformation

The REST API json response comes as a nested json structure.
The seismic events that are of the most interest are in the "features" array of objects that needs to be iterated over and extracted into a Python array.
This array was turned into a Pandas dataframe.
Additional columns were created that converted the UNIX or POSIX time (ms) into datetime objects.
Example of Flattened data

Data Modeling

The data was denormalized and modeled using the One Big Table(OBT) method.
This allows for no joins of the data for the analyst and typically faster query performance for data warehouses.
Additional storage costs are typically a non-issue.

Data Storage

Both the raw json's and flattened parquet files are stored in a datalake in Google Cloud Storage bucket(GCS)

Airflow Orchestration

The DAG does the following on a '@daily' schedule:

Parameterizes the API endpoint to use the dates passed in by using the Airflow provided template variables
Uses the Python requests library GET request and returns an API response to be stored in local memory with a variable.
Saves the raw json file to local storage, uploads the blob into the data lake, then deletes the local json file.
Uses Python to parse the nested JSON into list of dictionaries that gets transformed into a Pandas dataframe.
Converts the DataFrame into a parquet file that gets saved locally.
Uploads the parquet into my Google Cloud Storage data lake with the parameterized date as filename.
Deletes the locally stored parquet file.
Pushes the data lake parquet into a BigQuery Native table using the 'LOAD DATA INTO...' command.
Runs the dbt models using 'dbt_run' command with the BashOperator.

Airflow DAG

dbt Transformation

dbt was used to take the raw_earthquakes data from BigQuery native table and deduplicate the data using a SQL window function.
The goal is to have only one record of seismic event using the seismic 'id' column.
Requirements indicate that only the latest version of the event based on its 'properties_updated_datetime' timestamp are needed.
It is assumed as the seismic data is reprocessed, this architecture design will allow analysts to examine the historical changes of a particular seismic event, yet still show the latest version of each event.

with earthquakes as 
(
    select *,
        row_number() over(partition by id, properties_updated_datetime) as rn
    from {{ source('raw','raw_earthquakes') }}
    where id is not null 
)
select ...
from earthquakes 
where rn = 1

The stg_earthquakes was materialized as view.
The cast function was used to rename columns and change data types.
This view is the source reference for the fact_earthquakes table that is partitioned by "properties_time_datetime" timestamp and materialized as an incremental table.

{{ config(
    materialized='incremental',
    partition_by={
      "field": "properties_updated_datetime",
      "data_type": "timestamp",
      "granularity": "day"
    }
)}}

dbt Lineage Graph

dbt has a built in tool used to visually show data lineage as it travels across the data layers: raw -> stage -> final

Visualization

With the fact_table materialized in a partitioned BigQuery native table, the data can now be viewed in Google Looker Studio (formerly Data Studio)
Looker Studio has 3 steps:

Pick a data source, including 100+ data connectors including BigQuery.
Pick the columns needed, or create your own using logical functions.
Create the dashboard.

Future Work That Could Be Done

Testing: dbt testing, airflow testing
More experimentation with the Airflow configuration and VM instance. I would like to attempt a lightweight version using the sequential executor instead of the celery executor and SQLite over Postgres backend.
I would like to try more automation around the devops implimentation of the entire pipeline. I would like to see if I could automate more of the dockerfile to execute more steps of the initialization.
Explore making the airflow instance less brittle. Not using any local compute, but rather push more compute to external cloud functions.
Build out more reports, push out weekly updates via slack webhook or email
Integrate Population Data to show earthquakes potential impact to human inhabitants.
Data Science, specifically time series analysis

Clean Up

A GCP trial account was used to work on this project with 90 day access. The final step was to transfer the final fact_earthquakes BiqQuery table to my personal account for safe keeping. The BigQuery Copy Command was utilized.

bq cp -f \
old_dataset.fact_earthquakes \
main_projectid:new_dataset.fact_earthquakes

The final dashboard was conected to this BigQuery Table.

Terraform

Lastly, the cloud resources were archived and exported to Terraform HCL code using GCP gcloud beta resource-config bulk-export.

Acknowledgements

Thanks to the instructors.

And my employer and teammates.

evolv Consulting

My Linkedin

ANelson82/de_zoomcamp_2022_earthquake_capstone