In the realm of online retail, understanding customer behavior, identifying sales trends, and optimizing inventory management are critical for success. HoIver, extracting actionable insights from vast amounts of transactional data poses significant challenges. This project aims to address these challenges by developing an end-to-end data pipeline and analytics engineering solution.
- Design and implement a robust data pipeline to ingest, process, and analyze online retail transaction data.
- Utilize analytics engineering techniques to transform raw data into actionable insights, ensuring data quality, consistency, and relevance.
- Build intuitive dashboards to visualize key performance indicators (KPIs), trends, and patterns, enabling stakeholders to make informed decisions.
- Automate infrastructure provisioning, data processing, and dashboard deployment using Infrastructure as Code (IaC) and Continuous Integration/Continuous Deployment (CI/CD) practices.
-
Cloud Platform: Google Cloud Platform (GCP)
-
Infrastructure as Code (IaC): Terraform
-
Workflow Orchestration: Mage
-
Data Warehouse: BigQuery
-
Data Modeling and Transformation: dbt (Data Build Tool)
-
Dashboarding: Looker Studio
The project utilizes the "Online Retail" dataset from the UCI Machine Learning Repository. This dataset contains transactional data from an online retail store based in the United Kingdom, covering transactions betIen 01/12/2010 and 09/12/2011.
├── data
│ ├── Online Retail.xlsx
│ └── README.md
Here are some details about the "Online Retail" dataset:
- Source: UCI Machine Learning Repository
- Description: This dataset contains transactional data from an online retail store based in the United Kingdom. The data includes customer information, product details, invoice numbers, transaction dates, and sales quantities. It covers transactions that occurred betIen 01/12/2010 and 09/12/2011.
- Features: The dataset includes attributes such as InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, and Country.
- Format: The dataset is typically provided in a CSV (Comma Separated Values) format.
- Size: The dataset is relatively small, with around 500,000 records.
You can find the "Online Retail" dataset on the UCI Machine Learning Repository Ibsite. Here is the direct link: Online Retail Dataset.
Terraform is an open-source infrastructure as code software tool created by HashiCorp. It allows users to define and provision data center infrastructure using a high-level configuration language known as HashiCorp Configuration Language (HCL).
├── mage-ai-Terraform-Infrastructure-As-Code
│ ├── README.md
│ ├── db.tf
│ ├── fs.tf
│ ├── load_balancer.tf
│ ├── main.tf
│ └── variables.tf
- Installation: Ensure Terraform is installed on your local machine.
- Configuration: Modify the
variables.tf
files to match desired infrastructure configuration. - Initialization: Run
terraform init
to initialize the working directory containing Terraform configuration files. - Planning: Run
terraform plan
to create an execution plan. This step is optional but recommended to verify changes before applying them. - Execution: Run
terraform apply
to apply the changes required to reach the desired state of the configuration. - Verification: After applying changes, verify that the infrastructure has been provisioned correctly.
- Cleanup: When done, run
terraform destroy
to destroy all the resources defined in the Terraform configuration.
- Implemented workflow orchestration using Mage to Develop batch processing and managing pipeline execution.
I created two main pipelines to handle the processing of my dataset:
-
loading_from_api_transformations_to_gcs_partitioned:
- This pipeline focuses on extracting data from an API source, applying necessary transformations and cleaning processes, and then exporting the processed data to Google Cloud Storage (GCS) in a partitioned format.
-
loading_from_google_cloud_storage_to_bigquery_table:
- This pipeline is responsible for loading data from Google Cloud Storage into BigQuery tables. It includes components for data loading from GCS, transformation steps (if required), and exporting the data to BigQuery tables.
- Pipelines Directory:
└── mage-ai-Workflow-Orchestration
├── magic-zoomcamp
├── pipelines
│ └── loading_from_api_transformations_to_gcs_partitioned
├── data_loaders
│ └── load_data_from_api.py
├── transformers
| └── transform_and_clean_api_data.py
└── data_exporters
└── export_data_to_google_cloud_stoage_partitioned.py
- Pipeline Components:
- Data Loaders:
load_data_from_api.py
: Module responsible for extracting data from the API source.
- Transformers:
transform_and_clean_api_data.py
: Module for applying transformations and cleaning processes to the extracted data.
- Data Exporters:
export_data_to_google_cloud_stoage_partitioned.py
: Exporter module to export the processed data to Google Cloud Storage in a partitioned format.
- Data Loaders:
- Pipelines Directory:
└── mage-ai-Workflow-Orchestration
├── magic-zoomcamp
├── pipelines
│ └── loading_from_google_cloud_storage_to_bigquery_table
├── data_loaders
│ └── load_data_from_google_cloud_storage.py
└─ data_exporters
└── export_data_to_google_bigquery.py
- Pipeline Components:
- Data Loaders:
load_data_from_google_cloud_storage.py
: Module responsible for loading data from Google Cloud Storage.
- Data Exporters:
export_data_to_google_bigquery.py
: Exporter module to export the processed data to BigQuery tables.
- Data Loaders:
I configured triggers or schedules within Mage to orchestrate the execution of my pipeline components at predefined intervals. By setting up daily triggers, I ensured that my pipelines run periodically, keeping my data processing up-to-date and synchronized with the latest changes.
In this section, I leverage dedicated data transformation tool DBT (Data Build Tool) to prepare my data for analytics. dbt enables us to apply transformations, ensure data quality, and maintain consistency through validation and cleansing processes. Let's delve into the details:
├── dbt-Analytics_Engineering(Data Build Tool)
│ ├── dbt_project.yml
│ ├── models
│ | ├── core
│ | │ ├── fact_online_retail_data_partitioned_clustered.sql
│ | │ └── fact_online_retail_data_unpartitioned.sql
│ | └── staging
│ | ├── schema.yml
│ | └── stg_online_retail_data.sql
│ └── tests
I've organized my transformations into dbt models within the project structure. These models serve the purpose of converting raw data from staging tables into structured formats suitable for analytics and reporting. Let's dive into the details of my transformation process:
-
Staging Tables:
- To kick off the transformation journey, I load raw data into staging tables (
stg_online_retail_data
) sourced from various data origins. These staging tables act as holding grounds for the unprocessed data retrieved from external systems.
- To kick off the transformation journey, I load raw data into staging tables (
-
Core Models:
-
Within the
core
directory, I've crafted dbt models that take charge of refining staged data into structured, analytics-ready formats. These models typically represent fact and dimension tables, shaping the foundation for insightful analysis and reporting. -
Includes:
-
-
Partitioning and Clustering:
- To optimize query performance and streamline data accessibility, I've tactically partitioned and clustered my tables. Notably, the fact tables are partitioned by
invoice_datetime
and clustered bycustomer_id
, aligning with the typical access patterns of my analytics queries.
- To optimize query performance and streamline data accessibility, I've tactically partitioned and clustered my tables. Notably, the fact tables are partitioned by
-
Testing:
- Ensuring data quality and consistency is paramount. Hence, I've implemented rigorous testing mechanisms. The
schema.yml
file houses definitions for tests designed to validate the structure and integrity of my transformed data.
- Ensuring data quality and consistency is paramount. Hence, I've implemented rigorous testing mechanisms. The
To ensure the continuous availability of up-to-date analytics-ready data, I've configured a dbt deployment job to run every 12 hours. This automated process seamlessly deploys my transformed data to BigQuery, facilitating a consistent and reliable analytics environment.
├── (CI.CD)-Continous-Intergration-and-Deployment
│ └── cloudbuild.yaml
This file serves as the blueprint for orchestrating the CI/CD process using Google Cloud Build.
During the CI phase, the defined workflow in cloudbuild.yaml
triggers whenever changes are pushed to the repository. This triggers an automated build process, ensuring that the latest changes are integrated and tested seamlessly.
Following successful CI, the CD pipeline takes over. It automatically deploys the updated artifacts, ensuring the continuous delivery of the data pipeline to the production environment. This automated deployment process eliminates manual interventions, reducing the risk of errors and accelerating the time-to-market for new features and enhancements.
Visualizing insights from data is crucial for understanding trends, patterns, and opportunities. Here's how I designed the dashboard using Looker Studio:
Looker Studio provides a user-friendly interface for building and customizing dashboards, allowing for intuitive visualization of data insights.
└── Looker_studio_Dashboard
└── online_retail_data_Report.pdf
The dashboard provides several key insights into the processed data:
-
Top Selling Products:
- Visualizes the products that sold the most, providing insights into popular items and potential revenue drivers.
-
Customer Distribution by Country:
- Highlights countries with the highest number of customers, aiding in understanding the geographic distribution of your customer base.
-
Daily Sales Trend:
- Illustrates the daily sales trend, identifying peak sales days and potential factors influencing sales fluctuations.
-
Monthly Sales Overview:
- Presents an overview of monthly sales, helping to identify trends and seasonality in sales performance.
View Online Retail Data Report
Feel free to explore the dashboard using the provided link to gain deeper insights into the online retail data.
By leveraging cloud services, IaC, workflow orchestration, and visualization tools, this project optimizes online retail analytics. It enables businesses to derive actionable insights from raw transactional data, facilitating informed decision-making, strategic planning, and enhanced customer experiences. The end-to-end data pipeline and analytics engineering solution demonstrate the poIr of modern data technologies in driving business growth and innovation.