/Real-Time-Streaming

End to End Data Engineering using PySpark, Kafka, Cassandra and Docker

Primary LanguagePython

πŸ€– Data Project

Learning Data Engineering.

Overview β€’ Technologies and Tools Used β€’ Project Structure β€’ Getting Started β€’ Running the Pipeline What I Learned

🚧 Data Engineering Project πŸš€ Finished 🚧

wakatime

Overview

This project demonstrates a data processing pipeline using Kafka, PySpark, Docker, Cassandra, and OpenAI. The goal was to create a system for real-time data streaming and processing, integrating various technologies to build a scalable and efficient architecture.

Features

  • Kafka: A distributed streaming platform used for building real-time data pipelines and streaming applications.
  • PySpark: The Python API for Apache Spark, used for large-scale data processing and analytics.
  • Docker: A platform for automating containerized applications, ensuring consistent environments across development, testing, and production.
  • Cassandra: A distributed NoSQL database designed for handling large amounts of data across many commodity servers.
  • OpenAI: Used for integrating advanced language models into the pipeline.
  • Python: The primary programming language used for scripting and application logic.

Project Structure

β”œβ”€β”€ jobs/requirements.txt         
β”œβ”€β”€ jobs/spark-consumer.py                
β”œβ”€β”€ .env.example         
β”œβ”€β”€ constants.py              
β”œβ”€β”€ docker-compose.yml                 
β”œβ”€β”€ main.py  

Scripts Overview

  • jobs/requirements.txt: Lists the Python dependencies required for the Spark consumer job.
  • jobs/spark-consumer.py: Contains the code for consuming data from Kafka and processing it using PySpark.
  • .env.example: An example environment variable file to configure your local environment.
  • constants.py: Defines constants used throughout the project.
  • docker-compose.yml: Defines and runs multi-container Docker applications, configuring services for Kafka, Cassandra, and other components.
  • main.py: The main script to initialize and run the application.

Getting Started

To get started with this project, follow these steps:

  1. Clone the Repository:

    git clone <repository_url>
    cd <repository_directory>
    
  2. Setup Environment:

    # Install Python version
    pyenv install 3.10.12
    
    # Create a virtual environment
    pyenv virtualenv 3.10.12 <env_name>
    
    # Activate the environment
    pyenv activate <env_name>
    
    # Install dependencies
    pip install -r jobs/requirements.txt
  3. Run Docker Containers

    docker compose up -d
  4. Setup database

    spark-submit --packages com.datastax.spark:spark-cassandra-connector_2.12:3.5.0,org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0 spark-consumer.pydocker exec -it realestatedataengineering-spark-master-1
  5. Execute the main script

    python main.py
    

What I learned

  • Kafka Integration: Gained experience in using Kafka for real-time data streaming and message brokering.
  • PySpark: Developed skills in large-scale data processing and analytics using PySpark.
  • Docker: Learned to containerize applications and manage multi-container setups with Docker Compose.
  • Cassandra: Worked with Cassandra for scalable and distributed database solutions.
  • OpenAI API: Integrated OpenAI’s language models for advanced text processing and analysis.

Author


Fonte: https://www.youtube.com/@CodeWithYu/videos