Final-Project

Our project aims to perform sentiment analysis on Twitter data using PySpark, employing a comprehensive array of advanced technologies, including but not limited to Word2Vec, Multi-Layer Perceptron Classifier, Naive Bayes, and nltk. Through the integrated application of these techniques, our focus is on delving into the emotional content within Twitter text to gain a more holistic understanding of users' emotional inclinations and attitudes.

Main.py

This Python script utilizes PySpark, a distributed data processing framework, to perform sentiment analysis on Twitter data. The analysis involves preprocessing, text transformation using Word2Vec, and classification using a Multilayer Perceptron (MLP) neural network.

NLTK tokenization.ipynb

The script demonstrates a comprehensive approach to sentiment analysis on Twitter data using PySpark, incorporating data preprocessing, text cleaning, Word2Vec embedding, and training an MLP model. The chosen evaluation metrics provide insights into the model's performance in classifying sentiments on the testing set.

NaiveBayes.py

The script involves several steps, including data preprocessing, natural language processing (NLP) techniques, and the training and evaluation of a Naive Bayes classifier for sentiment analysis.

SentimentAnalysis_NonSpark.ipynb

This script reads the raw data, performs text preprocessing, and conducts initial visualization. It involves training some classic machine learning algorithms as well as a Pretrained Transformer Bert model. The prediction results are presented through a report, confusion matrix, and bar graph. The last image displays a portion of the results from the BERT model, limited by time and computing resources.

Remind

Make sure to have the necessary libraries and dependencies installed before running the script.