The Cornershop EtLT Pipeline is designed to process data from 540 distinct products sourced from major supermarkets in FlorianĂłpolis. Leveraging the power of Prefect, DuckDB, and GCP, this pipeline ensures efficient data processing, accuracy in data representation, and scalability to handle growing data demands.
With a multitude of products from various supermarkets, there's a need to have a robust pipeline that can extract, transform, load, and then transform the data again, ensuring it's ready for analytics and insights derivation.
-
E(xtract):
This component interfaces directly with the Cornershop API, ensuring accurate and timely data extraction.
-
t(ransform):
Utilizing DuckDB, the data undergoes normalization processes, preparing it for the subsequent loading phase.
-
L(oad):
Orchestrated by Prefect, this phase ensures that the data is stored in a structured and optimized manner.
-
(T)ransform:
Another round of transformation is applied using DuckDB. Here, business rules are translated, turning the raw data into actionable insights that can drive business decisions.
- Artifact Registry: Ensures consistent runtime environments by storing the Docker image.
- Cloud Run: Entrusted with executing the data processing scripts.
- Compute Engine: Acts as the host for the Prefect agent, which orchestrates the entire process.
- Cloud Storage: Serves as the central repository for daily data, ensuring data integrity and ease of access.
- BigQuery: A serverless data warehouse solution that facilitates analytics, making sense of the processed data.
- Artifact Registry: If no repository is present, one is initialized to manage versions and changes.
- Docker Operations: It takes charge of constructing a Docker image based on the Dockerfile and subsequently pushes it to the Artifact Registry.
- VM Deployment: Deploys a VM, overwriting existing VMs with the same name. For diverse requirements, the action can be executed with distinct VM/queue names.
- Prefect Agent: This component deploys a Docker container equipped with a Prefect agent, enabling flow runs. By default, it configures the flows as serverless containers via Google Cloud Run, offering flexibility and scalability. Any necessary adjustments in resource allocation can be done via the Prefect UI.
- Prefect Blocks & Flows: Both blocks and flows under Prefect are auto-deployed, streamlining the initialization process.
Column | Description |
---|---|
date | Date of the collection record |
aisle_name | Name of the aisle in the store |
product_name | Name of the product |
brand | Brand of the product |
price | Price of the product in R$ |
package | Information about the product's packaging (ml/kg/packages/etc) |
store_name | Name of the store (supermarket, pharmacy, etc) |
city | City where the store is located |
search_term | Search term used to find the product via the API |
- product_info: This table provides detailed information about products.
- store_info: This table gives insights about different stores.
- transaction_info: This is a fact table that logs transaction data with references to both
product_info
andstore_info
.