Q Company Data Solution

Project Overview
Objectives and Goals
Architecture
Technology Stack
Data Flow
Key Features
Maintenance and Improvement
Authors

Project Overview

The Q Company Data Platform project develops an efficient data processing system combining batch and stream processing using AWS architecture. This platform handles diverse and dynamic data generated by Q Company's retail operations, including sales transactions, customer interactions, and app logs. By leveraging tools such as Spark, Hive, and Kafka, the project ensures robust data ingestion, transformation, storage, and analysis, ultimately enabling better decision-making and insights for various business teams.

Objectives and Goals

Data Ingestion and Storage: Efficiently ingest and store raw data files in a data lake, ensuring the data is well-partitioned and easily traceable.
Data Processing and Transformation: Clean and process raw data files using Spark and store the processed data in Hive tables to serve as a Data Warehouse (DWH).
Insight Generation: Generate valuable business insights, including most selling products, most redeemed offers, and least performing cities in online sales.
Daily Reporting: Provide a daily dump of sales agent performance in CSV format for the B2B team.
Real-time Processing: Process dynamic app logs in real-time using Kafka and Spark Streaming, storing the results in HDFS for immediate analysis.
Flexible Reporting: Enable flexible and efficient querying of processed data to support various business needs.

Architecture

The data flow architecture leverages AWS services to ensure a scalable, reliable, and efficient data processing pipeline. The architecture is divided into multiple layers:

Staging Layer: Initial storage for raw data files.
Standardized Layer: Interface for users to explore and build use cases using validated raw data.
Conformed Layer: Repository of common entities used across the organization.
Enriched Layer: Hosts the final data products used in business processes.

Technology Stack

Apache Spark: Unified analytics engine for large-scale data processing
Apache Hive: Data warehousing solution for Hadoop
Apache Kafka: Distributed event streaming platform
Hadoop Distributed File System (HDFS): Distributed storage for large datasets
PostgreSQL: Relational database (used as Hive metastore)
Python: Scripting language for automation and custom ETL processes
Docker: Containerization platform
Metabase: Open-source business intelligence tool for creating dashboards and visual reports

Data Flow

Data Extraction: Connect to FTP server, download recent files, process and merge data.
Data Cleansing: Standardize data by applying transformations like renaming columns, removing blank columns, and validating data.
Data Transformation: Structure and enrich data with additional dimensions for detailed analysis.
Data Serving: Model data into various formats (SparkDF queries, SparkSQL, Hive Tables) for analysis and reporting.
Streaming: Handle real-time data using Kafka for ingestion and Spark Streaming for processing.

Key Features

Batch processing of hourly data files (branches, sales agents, sales transactions)
Real-time processing of app logs
Data cleansing and transformation pipeline
Business insights generation (e.g., most selling products, most redeemed offers)
Automated daily reporting for B2B team
Flexible querying for ad-hoc analysis
Scalable and fault-tolerant architecture

Maintenance and Improvement

Streaming Layer

Addressing small files problem in HDFS
Optimizing Spark jobs through partitioning and resource scaling

Batch Layer

Automating batch processing with Apache Airflow (planned)
Developing a comprehensive metadata layer for data management

Authors

Emad El-Din Adel
Yousef Saber Abdul-Kareem
Ahmed Hatem Ghazy

ahmedhattem11/Q-Company_Data_Solution