Objective:
- The goal was to develop an application that performs real-time sentiment analysis on Reddit posts, providing immediate insights into public opinion on various topics.
Pipeline Implementation:
-
Data Ingestion and Environment Setup:
- Configured a local big data environment comprising HDFS, Kafka, Zookeeper, and Hive, supplemented by cloud resources to optimize performance and scalability.
- Installed Python and PySpark, integrating them with Kafka to facilitate data streaming from Reddit.
-
API Integration and Data Streaming:
- Utilized PRAW (Python Reddit API Wrapper) to stream live Reddit data based on specific keywords.
- Implemented a Python script to filter and preprocess this data, ensuring cleanliness by removing URLs and non-essential content.
-
Real-time Data Analysis with Apache Spark:
- Applied a pre-trained sentiment analysis model using PySpark to assess the sentiment of each Reddit post in the stream, categorizing them into positive, negative, and neutral sentiments.
-
Data Storage and Visualization:
- Streamed the sentiment-analyzed data back into Kafka and utilized Spark Streaming to persist the processed data into a Hive table stored in Parquet format for efficiency and performance.
- Leveraged Grafana to create a dashboard that visualizes the sentiment analysis results in real-time, providing a user-friendly interface to monitor public opinion trends.
Challenges and Learnings:
- Encountered and overcame technical hurdles in setting up a seamless integration between the various components of the big data stack.
- Discovered best practices for streamlining data flow, such as optimizing Kafka topic configurations for robustness against potential data spikes.
- Gained insights into the nuances of real-time data processing and the criticality of preprocessing for sentiment analysis accuracy.
Outcomes and Further Developments:
- The Real-time Sentiment Analyzer effectively parsed through Reddit's live feed, offering stakeholders a pulse on current sentiment trends about specific topics.
- The project provided valuable hands-on experience with a modern big data stack and underscored the importance of real-time analytics in understanding public sentiment.
- Potential enhancements include geographical sentiment analysis, advanced NLP for finer sentiment granularity, and a comparative analysis dashboard for historical sentiment trends.
Steps to execute:
- Create Reddit API client API and secret key
- Then run reddit_api.py file (python reddit_api.py)
- Then file1.csv file will be generated in your local path
- Next run main.py file it will insert the csv file data into kafka topic (posts_topic)
- Once topic is generated we will run spark_streaming.py file (python spark_streaming.py)
- The above step will read data from kafka topic "posts_topic" and apply SentimentAnalyzer() to it
- Once analysis is done it is written back to kafka topic "sentiment_analysis"
- Next a parquet file is generated locally.