This project demonstrates a complete ELT pipeline for an e-commerce dataset sourced from Kaggle. The pipeline showcases data engineering capabilities, including raw data extraction, transformation into a star schema data warehouse model, and loading for analytical purposes. It uses modern data engineering tools and practices to deliver a robust, scalable, and maintainable solution.
- Data Warehouse Design: Implements a star schema for optimized querying and analytics, featuring a fact table and multiple dimension tables.
- Modular ELT Workflow:
- Extract: Data ingestion from the e-commerce dataset.
- Load: Data stored in Snowflake, a modern cloud-based data warehouse.
- Transform: Data transformation using dbt to clean, model, and create a star schema.
- Orchestration with Apache Airflow: Automates the pipeline execution in a Dockerized environment.
- Dockerized Setup: Ensures portability and reproducibility across systems.
- Python: Core programming language for the project.
- Apache Airflow: Orchestrates the ELT pipeline.
- dbt (Data Build Tool): Handles SQL-based transformations.
- Snowflake: Cloud-based data warehouse for data storage and analytics.
- Docker: Containerizes the environment for consistent development and deployment
The star schema organizes the data warehouse for efficient analytical querying.
fact_sales
: Contains transactional data, including order details, payments, and shipping information.
dim_customers
: Customer-related attributes such as location and demographics.dim_sellers
: Seller-related information, including locations.dim_products
: Product details, including categories and dimensions.dim_geolocation
: Geographical information for customers and sellers.dim_time
: Temporal data for time-series analysis.dim_reviews
: Customer reviews and ratings.
This project highlights:
- Expertise in modern ELT workflows.
- Design and implementation of a star schema for data warehouses.
- Integration of dbt, Airflow, and Snowflake in a cohesive pipeline.
- Dockerized development for portability.