Super-Pipe: EtLT Pipeline with Prefect, DuckDB, and GCP

The Cornershop EtLT Pipeline is designed to process data from 540 distinct products sourced from major supermarkets in Florianópolis. Leveraging the power of Prefect, DuckDB, and GCP, this pipeline ensures efficient data processing, accuracy in data representation, and scalability to handle growing data demands.

Motivation

With a multitude of products from various supermarkets, there's a need to have a robust pipeline that can extract, transform, load, and then transform the data again, ensuring it's ready for analytics and insights derivation.

Data Pipeline

EtLT Components:

E(xtract):

This component interfaces directly with the Cornershop API, ensuring accurate and timely data extraction.
t(ransform):

Utilizing DuckDB, the data undergoes normalization processes, preparing it for the subsequent loading phase.
L(oad):

Orchestrated by Prefect, this phase ensures that the data is stored in a structured and optimized manner.
(T)ransform:

Another round of transformation is applied using DuckDB. Here, business rules are translated, turning the raw data into actionable insights that can drive business decisions.

GCP Integration:

Artifact Registry: Ensures consistent runtime environments by storing the Docker image.
Cloud Run: Entrusted with executing the data processing scripts.
Compute Engine: Acts as the host for the Prefect agent, which orchestrates the entire process.
Cloud Storage: Serves as the central repository for daily data, ensuring data integrity and ease of access.
BigQuery: A serverless data warehouse solution that facilitates analytics, making sense of the processed data.

Initialization via GitHub Action:

Artifact Registry: If no repository is present, one is initialized to manage versions and changes.
Docker Operations: It takes charge of constructing a Docker image based on the Dockerfile and subsequently pushes it to the Artifact Registry.
VM Deployment: Deploys a VM, overwriting existing VMs with the same name. For diverse requirements, the action can be executed with distinct VM/queue names.
Prefect Agent: This component deploys a Docker container equipped with a Prefect agent, enabling flow runs. By default, it configures the flows as serverless containers via Google Cloud Run, offering flexibility and scalability. Any necessary adjustments in resource allocation can be done via the Prefect UI.
Prefect Blocks & Flows: Both blocks and flows under Prefect are auto-deployed, streamlining the initialization process.

Dataset

Column	Description
date	Date of the collection record
aisle_name	Name of the aisle in the store
product_name	Name of the product
brand	Brand of the product
price	Price of the product in R$
package	Information about the product's packaging (ml/kg/packages/etc)
store_name	Name of the store (supermarket, pharmacy, etc)
city	City where the store is located
search_term	Search term used to find the product via the API

Data Modeling

product_info: This table provides detailed information about products.
store_info: This table gives insights about different stores.
transaction_info: This is a fact table that logs transaction data with references to both product_info and store_info.

ericyaang/super-pipe