/realtime-sentiment-stream

This project serves as a comprehensive guide to building an end-to-end data engineering pipeline using TCP/IP Socket, Apache Spark, OpenAI LLM, Kafka. It covers each stage from data acquisition, processing, sentiment analysis with ChatGPT, production to kafka topic.

Primary LanguagePython

docker compose build and start the spark cluster

docker compose up --build -d

run the streaming socket

docker exec -it spark-master /bin/bash
/opt/bitnami/spark# python jobs/streaming_socket.py 

submit a job to spark master and run the spark streaming job

docker exec -it spark-master spark-submit \
--master spark://spark-master:7077 \
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0 \
jobs/spark_streaming.py

download the dataset from yelp and extract the JSON files into datasets directory

Yelp.com/dataset