This project automatically ingests, stores, transforms the latest seismic activity data from the USGS (United States Geologic Society) for later analysis.
This GitHub repository fulfills the final capstone project for the Data Engineering Zoomcamp by DataTalks.Club.
Earthquake Capstone Project Google Looker Studio
Analysts are studying the frequency, intensity, and spatial occurrence of seismic activity. This data needs to be automatically stored and processed in a way that analysts can quickly analyze and build out reports and dashboards. The technical implementation needs to be:
- Automated
- Reliable
- Scalable
- Affordable
- Infrastructure as code (IaC): Terraform
- Workflow orchestration: Airflow
- Containerization: Docker
- Data Lake: Google Cloud Storage (GCS)
- Data Warehouse: BigQuery
- Transformations: dbt
- Visualization: Google Data Studio / Looker Studio
The data comes from access of a public REST API from the USGS in collaboration with the International Federation of Digital Seismograph Networks (FDSN). The data is accessed through the query API endpoint, which has 3 parameters:
- format={geojson}
- starttime={yyyy-mm-dd}
- endtime={yyyy-mm-dd}
The geojson data format was used in this project.
-
The REST API json response comes as a nested json structure.
-
The seismic events that are of the most interest are in the "features" array of objects that needs to be iterated over and extracted into a Python array.
-
This array was turned into a Pandas dataframe.
-
Additional columns were created that converted the UNIX or POSIX time (ms) into datetime objects.
- The data was denormalized and modeled using the One Big Table(OBT) method.
- This allows for no joins of the data for the analyst and typically faster query performance for data warehouses.
- Additional storage costs are typically a non-issue.
- Both the raw json's and flattened parquet files are stored in a datalake in Google Cloud Storage bucket(GCS)
The DAG does the following on a '@daily' schedule:
- Parameterizes the API endpoint to use the dates passed in by using the Airflow provided template variables
- Uses the Python requests library GET request and returns an API response to be stored in local memory with a variable.
- Saves the raw json file to local storage, uploads the blob into the data lake, then deletes the local json file.
- Uses Python to parse the nested JSON into list of dictionaries that gets transformed into a Pandas dataframe.
- Converts the DataFrame into a parquet file that gets saved locally.
- Uploads the parquet into my Google Cloud Storage data lake with the parameterized date as filename.
- Deletes the locally stored parquet file.
- Pushes the data lake parquet into a BigQuery Native table using the 'LOAD DATA INTO...' command.
- Runs the dbt models using 'dbt_run' command with the BashOperator.
- dbt was used to take the
raw_earthquakes
data from BigQuery native table and deduplicate the data using a SQL window function. - The goal is to have only one record of seismic event using the seismic 'id' column.
- Requirements indicate that only the latest version of the event based on its 'properties_updated_datetime' timestamp are needed.
- It is assumed as the seismic data is reprocessed, this architecture design will allow analysts to examine the historical changes of a particular seismic event, yet still show the latest version of each event.
with earthquakes as
(
select *,
row_number() over(partition by id, properties_updated_datetime) as rn
from {{ source('raw','raw_earthquakes') }}
where id is not null
)
select ...
from earthquakes
where rn = 1
- The
stg_earthquakes
was materialized as view. - The cast function was used to rename columns and change data types.
- This view is the source reference for the
fact_earthquakes
table that is partitioned by "properties_time_datetime" timestamp and materialized as an incremental table.
{{ config(
materialized='incremental',
partition_by={
"field": "properties_updated_datetime",
"data_type": "timestamp",
"granularity": "day"
}
)}}
- dbt has a built in tool used to visually show data lineage as it travels across the data layers: raw -> stage -> final
- With the fact_table materialized in a partitioned BigQuery native table, the data can now be viewed in Google Looker Studio (formerly Data Studio)
- Looker Studio has 3 steps:
- Pick a data source, including 100+ data connectors including BigQuery.
- Pick the columns needed, or create your own using logical functions.
- Create the dashboard.
- Testing: dbt testing, airflow testing
- More experimentation with the Airflow configuration and VM instance. I would like to attempt a lightweight version using the sequential executor instead of the celery executor and SQLite over Postgres backend.
- I would like to try more automation around the devops implimentation of the entire pipeline. I would like to see if I could automate more of the dockerfile to execute more steps of the initialization.
- Explore making the airflow instance less brittle. Not using any local compute, but rather push more compute to external cloud functions.
- Build out more reports, push out weekly updates via slack webhook or email
- Integrate Population Data to show earthquakes potential impact to human inhabitants.
- Data Science, specifically time series analysis
A GCP trial account was used to work on this project with 90 day access. The final step was to transfer the final fact_earthquakes
BiqQuery table to my personal account for safe keeping.
The BigQuery Copy Command was utilized.
bq cp -f \
old_dataset.fact_earthquakes \
main_projectid:new_dataset.fact_earthquakes
The final dashboard was conected to this BigQuery Table.
Lastly, the cloud resources were archived and exported to Terraform HCL code using GCP gcloud beta resource-config bulk-export.
Thanks to the instructors.
And my employer and teammates.