BigData-Sentiment-Analyzer

Objective:

The goal was to develop an application that performs real-time sentiment analysis on Reddit posts, providing immediate insights into public opinion on various topics.

Pipeline Implementation:

Data Ingestion and Environment Setup:
- Configured a local big data environment comprising HDFS, Kafka, Zookeeper, and Hive, supplemented by cloud resources to optimize performance and scalability.
- Installed Python and PySpark, integrating them with Kafka to facilitate data streaming from Reddit.
API Integration and Data Streaming:
- Utilized PRAW (Python Reddit API Wrapper) to stream live Reddit data based on specific keywords.
- Implemented a Python script to filter and preprocess this data, ensuring cleanliness by removing URLs and non-essential content.
Real-time Data Analysis with Apache Spark:
- Applied a pre-trained sentiment analysis model using PySpark to assess the sentiment of each Reddit post in the stream, categorizing them into positive, negative, and neutral sentiments.
Data Storage and Visualization:
- Streamed the sentiment-analyzed data back into Kafka and utilized Spark Streaming to persist the processed data into a Hive table stored in Parquet format for efficiency and performance.
- Leveraged Grafana to create a dashboard that visualizes the sentiment analysis results in real-time, providing a user-friendly interface to monitor public opinion trends.

Challenges and Learnings:

Encountered and overcame technical hurdles in setting up a seamless integration between the various components of the big data stack.
Discovered best practices for streamlining data flow, such as optimizing Kafka topic configurations for robustness against potential data spikes.
Gained insights into the nuances of real-time data processing and the criticality of preprocessing for sentiment analysis accuracy.

Outcomes and Further Developments:

The Real-time Sentiment Analyzer effectively parsed through Reddit's live feed, offering stakeholders a pulse on current sentiment trends about specific topics.
The project provided valuable hands-on experience with a modern big data stack and underscored the importance of real-time analytics in understanding public sentiment.
Potential enhancements include geographical sentiment analysis, advanced NLP for finer sentiment granularity, and a comparative analysis dashboard for historical sentiment trends.

Steps to execute:

Create Reddit API client API and secret key
Then run reddit_api.py file (python reddit_api.py)
Then file1.csv file will be generated in your local path
Next run main.py file it will insert the csv file data into kafka topic (posts_topic)
Once topic is generated we will run spark_streaming.py file (python spark_streaming.py)
The above step will read data from kafka topic "posts_topic" and apply SentimentAnalyzer() to it
Once analysis is done it is written back to kafka topic "sentiment_analysis"
Next a parquet file is generated locally.

Vishal3041/BigData-Sentiment-Analyzer