/Real-Time-Twitter-Sentiment-Analysis

This repo contains Big Data Project, its about "Real Time Twitter Sentiment Analysis via Kafka, Spark Streaming, MongoDB and Django Dashboard".

Primary LanguageJupyter Notebook

Big Data Project: Real-Time Twitter Sentiment Analysis Using Kafka, Spark (MLLib & Streaming), MongoDB and Django.

Overview

This repository contains a Big Data project focused on real-time sentiment analysis of Twitter data (classification of tweets). The project leverages various technologies to collect, process, analyze, and visualize sentiment data from tweets in real-time.

Project Architecture

The project is built using the following components:

  • Apache Kafka: Used for real-time data ingestion from Twitter DataSet.

  • Spark Streaming: Processes the streaming data from Kafka to perform sentiment analysis.

  • MongoDB: Stores the processed sentiment data.

  • Django: Serves as the web framework for building a real-time dashboard to visualize the sentiment analysis results.

  • chart.js & matplotlib : for plotting.

  • This is the project plan : project img

Features

  • Real-time Data Ingestion: Collects live tweets using Kafka from the Twitter DataSet.
  • Stream Processing: Utilizes Spark Streaming to process and analyze the data in real-time.
  • Sentiment Analysis: Classifies tweets into different sentiment categories (positive, negative, neutral) using natural language processing (NLP) techniques.
  • Data Storage: Stores the sentiment analysis results in MongoDB for persistence.
  • Visualization: Provides a real-time dashboard built with Django to visualize the sentiment trends and insights.

Data description:

In This Project I'm using a Dataset (twitter_training.csv and twitter_validation.csv) to create pyspark Model and for create live tweets using Kafka. Each line of the "twitter_training.csv" learning database represents a Tweet, it contains over 74682 lines;

The data types of Features are:

  • Tweet ID: int
  • Entity: string
  • Sentiment: string (Target)
  • Tweet content: string

The validation database “twitter_validation.csv” contains 998 lines (Tweets) with the same features of “twitter_training.csv”.

This is the Data Source: https://www.kaggle.com/datasets/jp797498e/twitter-entity-sentiment-analysis

Repository Structure

  • Django-Dashboard : this folder contains Dashboard Django Application
  • Kafka-PySpark : this folder contains kafka provider and pyspark streaming (kafka consumer).
  • ML PySpark Model : this folder contains the trained model with jupyter notebook and datasets.
  • zk-single-kafka-single.yml : Download and install Apache Kafka in docker.
  • bigdataproject rapport : a brief report about the project (in french).

Getting Started

Prerequisites

To run this project, you will need the following installed on your system:

  • Docker (for runing Kafka)
  • Python 3.x
  • Apache Kafka
  • Apache Spark (PySpark for python)
  • MongoDB
  • Django

Installation

  1. Clone the repository:

    git clone https://github.com/drisskhattabi6/Real-Time-Twitter-Sentiment-Analysis.git
    cd Real-Time-Twitter-Sentiment-Analysis
  2. Installing Docker Desktop

  3. Set up Kafka:

    • Download and install Apache Kafka in docker using :
    docker-compose -f zk-single-kafka-single.yml up -d
  4. Set up MongoDB:

    • Download and install MongoDB.
      • It is recommended to install also MongoDBCompass to visualize data and makes working with mongodb easier.
  5. Install Python dependencies:

    • To install pySpark - PyMongo - Django ...
    pip install -r requirements.txt

Running the Project

Note : you will need MongoDB for Running the Kafka and Spark Streaming application and for Running Django Dashboard application.

  • Start MongoDB:
    • using command line :
    sudo systemctl start mongod
    • then use MongoDBCompass (Recommended).

Running the Kafka and Spark Streaming application :

  1. Change the directory to the application:

    cd Kafka-PySpark
  2. Start Kafka in docker:

    • using command line :
    docker exec -it <kafka-container-id> /bin/bash
    • or using docker desktop :

       docker desktop img

  3. Run kafka Zookeeper and a Broker:

    kafka-topics --create --topic twitter --bootstrap-server localhost:9092
    kafka-topics --describe --topic twitter --bootstrap-server localhost:9092
  4. Run kafka provider app:

    py producer-validation-tweets.py
  5. Run pyspark streaming (kafka consumer) app:

    py consumer-pyspark.py

Running the Kafka and Spark Streaming application img

this is an img of the MongoDBCompass after Running the Kafka and Spark Streaming application :

MongoDBCompass img

Running Django Dashboard application :

  1. Change the directory to the application:

    cd Django-Dashboard
  2. Creating static folder:

    python manage.py collectstatic
  3. Run the Django server:

    python manage.py runserver
  4. Access the Dashboard: Open your web browser and go to http://127.0.0.1:8000 to view the real-time sentiment analysis dashboard.

the Dashboard

Running the Dashboard

More informations :

  • Django Dashboard get the data from MongoDb DataBase.
  • the User can classify his owne text in http://127.0.0.1:8000/classify link.
  • in the Dashboard, There is a table contains tweets with labels.
  • in the Dashboard, There is 3 statistics or plots : labels rates - pie plot - bar plot.

Team :

Supervised By :

  • Prof. Yasyn El Yusufi

Abdelmalek Essaadi University - Faculty of Sciences and Technology of Tangier

  • Master: Artificial Intelligence and Data Science
  • Module: Big Data

  • By following the above instructions, you should be able to set up and run the real-time Twitter sentiment analysis project on your local machine. Happy coding!

  • Feel free to explore the project and customize it according to your requirements. If you encounter any issues or have any questions, don't hesitate to reach out!