/serverless_etl_pipeline

Developed an ETL pipeline for real-time ingestion of stock market data from the stock-market-data-manage.onrender.com API. Engineered the system to store data in Parquet format for optimized query processing and incorporated data quality checks to ensure accuracy prior to visualization.

Primary LanguagePython

Stock Data Ingestion and Visualization Pipeline

Project Overview

This project establishes a robust data pipeline using AWS services to ingest, transform, and visualize stock market data. By leveraging serverless and managed services, the architecture ensures scalability, maintainability, and efficiency.

Architecture

The pipeline is designed around several core AWS services:

  1. AWS Lambda: Interacts with an external API to fetch real-time stock market data.
  2. Amazon Kinesis Data Firehose: Streams data efficiently into AWS S3 for durable storage.
  3. AWS S3: Acts as a central repository to store the ingested stock market data in raw format.
  4. AWS Glue: Manages data cataloging and runs ETL (Extract, Transform, Load) jobs to transform raw data into a query-optimized format.
  5. AWS Athena: Used for running SQL queries against the data stored in S3, leveraging the serverless capabilities to handle large datasets.
  6. Grafana: Visualizes the transformed data to provide insights into stock market trends, connected to AWS Athena as the data source.

Data Flow

  1. Data Ingestion: The data ingestion begins with an external API call to https://stock-market-data-manage.onrender.com/ which provides detailed stock market data. The data includes various stock attributes such as open, high, low, close, adjusted close, and volume.
  2. Lambda Function: An AWS Lambda function is triggered periodically by an EventBridge schedule to call the external API and retrieve the latest data.
  3. Kinesis Data Firehose: The Lambda function pushes the data to Amazon Kinesis Data Firehose, which then streams this data to an S3 bucket.
  4. Data Transformation: AWS Glue is employed to catalog the data and run transformation jobs converting the raw data into a more analytics-friendly format (Parquet) that is optimized for quick retrieval and analysis.
  5. Visualization: Finally, the processed data is visualized using Grafana, providing dynamic and real-time insights into market trends and performance.

Project Goals

The primary goal of this project is to demonstrate the capabilities of AWS services in building an effective ETL pipeline, emphasizing the architectural choices and integrations necessary for real-time data processing and visualization in the cloud.

Tools and Technologies Used

AWS Lambda, AWS Kinesis, AWS S3, AWS Glue, AWS Athena, Grafana, and Python.

Getting Started with the Stock Data Ingestion and Visualization Pipeline

PART 1: Data Ingestion

Objective: Extract data from external sources and ingest it into AWS.

Steps:
  1. AWS S3 and Athena Setup:
    • Amazon S3 buckets: Create S3 buckets to store raw data. This data can be accessed by various AWS services throughout the pipeline.
    • AWS Athena: Use Athena's serverless query editor to analyze data directly in S3, providing a powerful tool for exploring and querying large datasets without server management.
  2. Data Ingestion Using Lambda:
    • AWS Lambda: Implement Python scripts in Lambda functions to fetch data from https://stock-market-data-manage.onrender.com/. Lambda allows running code without provisioning servers, handling the scaling automatically.
    • Automating Lambda Execution: EventBridge Triggers: Set up AWS EventBridge to trigger Lambda functions at specified intervals, ensuring regular data updates without manual intervention.

    • Data Storage: The data fetched is then pushed to the configured S3 bucket.

PART 2: Data Transformation

Objective: Organize and optimize the ingested data for analysis and visualization by transforming it into an efficient format and ensuring data integrity before visualization.

Steps:
  1. Batching Data with AWS Firehose:
    • AWS Firehose: Set up Firehose to collect and batch incoming data, streamlining the process of loading large volumes of data into S3. Firehose is configured to automatically partition the incoming data before it is stored, making future queries more efficient.
  2. Table Creation with Glue Crawler:
    • AWS Glue Crawler: Employ the Glue Crawler to automatically scan the batched data stored in S3 and create metadata tables in the AWS Glue Data Catalog. This step is essential for organizing data into a searchable and manageable format.
  3. Data Preparation with Glue Jobs:
    • Parquet Conversion: Configure Glue jobs to transform the batched data into the Parquet format. Parquet is chosen for its efficiency in storing columnar data, which significantly enhances both data compression and query performance.
    • Workflow Management: Implement a series of Glue jobs orchestrated within a Glue workflow to systematically process the data:
      • Data Crawling: Crawl new data to update the Glue Data Catalog with the latest datasets.
      • Data Cleaning: Remove outdated or redundant data from S3 to maintain data hygiene.
      • Table Optimization: Create optimized data tables in Parquet format that are structured for quick access and analysis.
      • Data Quality Assurance: Perform quality checks on the transformed data to ensure accuracy and consistency before it moves to production storage.
      • Data Finalization: Store the fully processed and verified data in a designated 'prod_s3_bucket' that serves as the final repository for data ready for analysis.
  4. Workflow Orchestration:
    • AWS Glue Workflow: Before the data visualization stage, orchestrate a comprehensive workflow using AWS Glue. This workflow manages the sequence of tasks from data ingestion to storage, ensuring that data flows seamlessly through each phase of the ETL process. It includes triggers for each job, error handling mechanisms, and dependency resolution to ensure that data is processed efficiently and correctly, ready for the next stage.
    • Monitoring and Logs: Utilize AWS CloudWatch to monitor the execution of the workflow and log all activities. This enables troubleshooting and optimization of the ETL process, ensuring high reliability and performance.

  5. Data Visualization Preparation: Integration with Grafana: Prepare the transformed and curated data for visualization by setting up integrations between the production data in 'prod_s3_bucket' and Grafana through AWS Athena. This setup allows for dynamic querying and visualization of the data, enabling users to generate real-time insights from the processed stock market data.