Sentiment Analysis on Twitter Data

This project focuses on building a sentiment analysis model using machine learning techniques to classify tweets as positive or negative. The model is trained on a large dataset of tweets with corresponding sentiment labels.

Dataset

The dataset used in this project is the "Sentiment140" dataset from Kaggle, which contains 1.6 million tweets. The dataset can be downloaded from here.

Dependencies

The following Python libraries are required to run the code:

pandas
re
pickle
nltk
sklearn

Approach

The main steps involved in this project are as follows:

Data Preprocessing: The dataset is loaded into a pandas DataFrame, and the text data is cleaned and preprocessed. This includes removing non-alphabetic characters, tokenizing the tweets, and stemming the words using NLTK's PorterStemmer.
Feature Extraction: The preprocessed text data is converted into numerical features using TfidfVectorizer from scikit-learn. This creates a sparse matrix of TF-IDF (Term Frequency-Inverse Document Frequency) features.
Model Training: The dataset is split into training and testing sets (80% for training and 20% for testing). A Logistic Regression model is trained on the training data using the TF-IDF features as input and the sentiment labels as targets.
Model Evaluation: The trained model is evaluated on both the training and testing data by calculating the accuracy score.
Model Saving: The trained model is saved as a pickle file for future use.
Model Testing: The saved model is loaded and tested on new examples from the testing set to ensure it is working as expected.

Usage

To run the code, simply open the main.ipynb notebook and execute the cells sequentially. The notebook is self-contained and includes all the necessary code and explanations.

Results