Learning Data Engineering.
Overview β’ Technologies and Tools Used β’ Project Structure β’ Getting Started β’ Running the Pipeline What I Learned
This project demonstrates a data processing pipeline using Kafka, PySpark, Docker, Cassandra, and OpenAI. The goal was to create a system for real-time data streaming and processing, integrating various technologies to build a scalable and efficient architecture.
- Kafka: A distributed streaming platform used for building real-time data pipelines and streaming applications.
- PySpark: The Python API for Apache Spark, used for large-scale data processing and analytics.
- Docker: A platform for automating containerized applications, ensuring consistent environments across development, testing, and production.
- Cassandra: A distributed NoSQL database designed for handling large amounts of data across many commodity servers.
- OpenAI: Used for integrating advanced language models into the pipeline.
- Python: The primary programming language used for scripting and application logic.
βββ jobs/requirements.txt
βββ jobs/spark-consumer.py
βββ .env.example
βββ constants.py
βββ docker-compose.yml
βββ main.py
jobs/requirements.txt
: Lists the Python dependencies required for the Spark consumer job.jobs/spark-consumer.py
: Contains the code for consuming data from Kafka and processing it using PySpark..env.example
: An example environment variable file to configure your local environment.constants.py
: Defines constants used throughout the project.docker-compose.yml
: Defines and runs multi-container Docker applications, configuring services for Kafka, Cassandra, and other components.main.py
: The main script to initialize and run the application.
To get started with this project, follow these steps:
-
Clone the Repository:
git clone <repository_url> cd <repository_directory>
-
Setup Environment:
# Install Python version pyenv install 3.10.12 # Create a virtual environment pyenv virtualenv 3.10.12 <env_name> # Activate the environment pyenv activate <env_name> # Install dependencies pip install -r jobs/requirements.txt
-
Run Docker Containers
docker compose up -d
-
Setup database
spark-submit --packages com.datastax.spark:spark-cassandra-connector_2.12:3.5.0,org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0 spark-consumer.pydocker exec -it realestatedataengineering-spark-master-1
-
Execute the main script
python main.py
- Kafka Integration: Gained experience in using Kafka for real-time data streaming and message brokering.
- PySpark: Developed skills in large-scale data processing and analytics using PySpark.
- Docker: Learned to containerize applications and manage multi-container setups with Docker Compose.
- Cassandra: Worked with Cassandra for scalable and distributed database solutions.
- OpenAI API: Integrated OpenAIβs language models for advanced text processing and analysis.