/Q-Company_Data_Solution

This repository hosts Q Company's data solution, implementing a Lambda Architecture for scalable retail and E-commerce data processing. It integrates batch and real-time processing for handling new products, customers, branches, and salespeople data efficiently. Designed with Hadoop, Spark, and Kafka.

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

Q Company Data Solution

Table of Contents

  1. Project Overview
  2. Objectives and Goals
  3. Architecture
  4. Technology Stack
  5. Data Flow
  6. Key Features
  7. Maintenance and Improvement
  8. Authors

Project Overview

The Q Company Data Platform project develops an efficient data processing system combining batch and stream processing using AWS architecture. This platform handles diverse and dynamic data generated by Q Company's retail operations, including sales transactions, customer interactions, and app logs. By leveraging tools such as Spark, Hive, and Kafka, the project ensures robust data ingestion, transformation, storage, and analysis, ultimately enabling better decision-making and insights for various business teams.

Objectives and Goals

  1. Data Ingestion and Storage: Efficiently ingest and store raw data files in a data lake, ensuring the data is well-partitioned and easily traceable.
  2. Data Processing and Transformation: Clean and process raw data files using Spark and store the processed data in Hive tables to serve as a Data Warehouse (DWH).
  3. Insight Generation: Generate valuable business insights, including most selling products, most redeemed offers, and least performing cities in online sales.
  4. Daily Reporting: Provide a daily dump of sales agent performance in CSV format for the B2B team.
  5. Real-time Processing: Process dynamic app logs in real-time using Kafka and Spark Streaming, storing the results in HDFS for immediate analysis.
  6. Flexible Reporting: Enable flexible and efficient querying of processed data to support various business needs.

Architecture

The data flow architecture leverages AWS services to ensure a scalable, reliable, and efficient data processing pipeline. The architecture is divided into multiple layers:

  1. Staging Layer: Initial storage for raw data files.
  2. Standardized Layer: Interface for users to explore and build use cases using validated raw data.
  3. Conformed Layer: Repository of common entities used across the organization.
  4. Enriched Layer: Hosts the final data products used in business processes.

Technology Stack

  • Apache Spark: Unified analytics engine for large-scale data processing
  • Apache Hive: Data warehousing solution for Hadoop
  • Apache Kafka: Distributed event streaming platform
  • Hadoop Distributed File System (HDFS): Distributed storage for large datasets
  • PostgreSQL: Relational database (used as Hive metastore)
  • Python: Scripting language for automation and custom ETL processes
  • Docker: Containerization platform
  • Metabase: Open-source business intelligence tool for creating dashboards and visual reports

Data Flow

  1. Data Extraction: Connect to FTP server, download recent files, process and merge data.
  2. Data Cleansing: Standardize data by applying transformations like renaming columns, removing blank columns, and validating data.
  3. Data Transformation: Structure and enrich data with additional dimensions for detailed analysis.
  4. Data Serving: Model data into various formats (SparkDF queries, SparkSQL, Hive Tables) for analysis and reporting.
  5. Streaming: Handle real-time data using Kafka for ingestion and Spark Streaming for processing.

Key Features

  • Batch processing of hourly data files (branches, sales agents, sales transactions)
  • Real-time processing of app logs
  • Data cleansing and transformation pipeline
  • Business insights generation (e.g., most selling products, most redeemed offers)
  • Automated daily reporting for B2B team
  • Flexible querying for ad-hoc analysis
  • Scalable and fault-tolerant architecture

Maintenance and Improvement

Streaming Layer

  • Addressing small files problem in HDFS
  • Optimizing Spark jobs through partitioning and resource scaling

Batch Layer

  • Automating batch processing with Apache Airflow (planned)
  • Developing a comprehensive metadata layer for data management

Authors

  • Emad El-Din Adel
  • Yousef Saber Abdul-Kareem
  • Ahmed Hatem Ghazy